public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
33+ messages / 5 participants
[nested] [flat]

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-05-07 12:35  Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-05-07 12:35 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>

Hello, Matthias and others!

Updated WIP in attach.

Changes are:
* Renaming, now it feels better for me
* More reliable approach in `GlobalVisHorizonKindForRel` to make sure we
have not missed `rd_safeindexconcurrentlybuilding` by calling
`RelationGetIndexList` if required
* Optimization to avoid any additional `RelationGetIndexList` if zero of
concurrently indexes are being built
* TOAST moved to TODO, since looks like it is out of scope - but not sure
yet, need to dive dipper

TODO:
* TOAST
* docs and comments
* make sure non-data tables are not affected
* Per-database scope of optimization
* Handle index building errors correctly in optimization code
* More tests: create index, multiple re-indexes, multiple tables

Thanks,
Michail.


Attachments:

  [text/x-patch] v2-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch (22.6K, 3-v2-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch)
  download | inline diff:
From 63677046efc9b6a1d93f9248c6d9dce14a945a42 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 7 May 2024 14:24:09 +0200
Subject: [PATCH v2] WIP: fix d9d076222f5b "VACUUM: ignore indexing operations 
 with CONCURRENTLY" which was reverted by e28bb8851969.

Issue was caused by absent of any snapshot actually protects the data in relation in the required to build index correctly.

Introduce new type of visibility horizon to be used for relation with concurrently build indexes (in the case of "safe" index).

Now `GlobalVisHorizonKindForRel` may dynamically decide which horizon to used base of the data about safe indexes being built concurrently.

To reduce performance impact counter of concurrently built indexes updated in shared memory.
---
 src/backend/catalog/index.c              |  36 ++++++
 src/backend/commands/indexcmds.c         |  20 +++
 src/backend/storage/ipc/ipci.c           |   2 +
 src/backend/storage/ipc/procarray.c      |  88 ++++++++++++-
 src/backend/utils/cache/relcache.c       |  11 ++
 src/bin/pg_amcheck/t/006_concurrently.pl | 155 +++++++++++++++++++++++
 src/include/catalog/index.h              |   5 +
 src/include/utils/rel.h                  |   1 +
 8 files changed, 311 insertions(+), 7 deletions(-)
 create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a8568c55c..3caa2bab12 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -97,6 +97,11 @@ typedef struct
 	Oid			pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
 } SerializedReindexState;
 
+typedef struct {
+	pg_atomic_uint32 numSafeConcurrentlyBuiltIndexes;
+} SafeICSharedState;
+static SafeICSharedState *SafeICStateShmem;
+
 /* non-export function prototypes */
 static bool relationHasPrimaryKey(Relation rel);
 static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -176,6 +181,37 @@ relationHasPrimaryKey(Relation rel)
 	return result;
 }
 
+
+void SafeICStateShmemInit(void)
+{
+	bool		found;
+
+	SafeICStateShmem = (SafeICSharedState *)
+			ShmemInitStruct("Safe Concurrently Build Indexes",
+							sizeof(SafeICSharedState),
+							&found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+		pg_atomic_init_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 0);
+	} else
+		Assert(found);
+}
+
+void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment)
+{
+	if (increment)
+		pg_atomic_fetch_add_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+	else
+		pg_atomic_fetch_sub_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+}
+
+bool IsAnySafeIndexBuildsConcurrently()
+{
+	return pg_atomic_read_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes) > 0;
+}
+
 /*
  * index_check_primary_key
  *		Apply special checks needed before creating a PRIMARY KEY index
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487..663450ba20 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1636,6 +1636,8 @@ DefineIndex(Oid tableId,
 	 * hold lock on the parent table.  This might need to change later.
 	 */
 	LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
+	if (safe_index && concurrent)
+		UpdateNumSafeConcurrentlyBuiltIndexes(true);
 
 	PopActiveSnapshot();
 	CommitTransactionCommand();
@@ -1804,7 +1806,15 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	/* Commit index as valid before reducing counter of safe concurrently build indexes */
+	CommitTransactionCommand();
 
+	Assert(concurrent);
+	if (safe_index)
+		UpdateNumSafeConcurrentlyBuiltIndexes(false);
+
+	/* Start a new transaction to finish process properly */
+	StartTransactionCommand();
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
 	 */
@@ -3902,6 +3912,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					 indexRel->rd_indpred == NIL);
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
+		if (idx->safe)
+			UpdateNumSafeConcurrentlyBuiltIndexes(true);
 
 		/* This function shouldn't be called for temporary relations. */
 		if (indexRel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
@@ -4345,6 +4357,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		UnlockRelationIdForSession(lockrelid, ShareUpdateExclusiveLock);
 	}
 
+	// now we may clear safe index building flags
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		if (newidx->safe)
+			UpdateNumSafeConcurrentlyBuiltIndexes(false);
+	}
+
 	/* Start a new transaction to finish process properly */
 	StartTransactionCommand();
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..260a634f1b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "catalog/index.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -357,6 +358,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
 	InjectionPointShmemInit();
+	SafeICStateShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 1a83c4220b..de3b3a5c0c 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -53,6 +53,7 @@
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/index.h"
 #include "catalog/pg_authid.h"
 #include "commands/dbcommands.h"
 #include "miscadmin.h"
@@ -236,6 +237,12 @@ typedef struct ComputeXidHorizonsResult
 	 */
 	TransactionId data_oldest_nonremovable;
 
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in normal user
+	 * defined tables with index building in progress by process with PROC_INSAFE_IC.
+	 */
+	TransactionId data_safe_ic_oldest_nonremovable;
+
 	/*
 	 * Oldest xid for which deleted tuples need to be retained in this
 	 * session's temporary tables.
@@ -251,6 +258,7 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_SHARED,
 	VISHORIZON_CATALOG,
 	VISHORIZON_DATA,
+	VISHORIZON_DATA_SAFE_IC,
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
@@ -297,6 +305,7 @@ static TransactionId standbySnapshotPendingXmin;
 static GlobalVisState GlobalVisSharedRels;
 static GlobalVisState GlobalVisCatalogRels;
 static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisDataSafeIcRels;
 static GlobalVisState GlobalVisTempRels;
 
 /*
@@ -1727,9 +1736,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	bool		in_recovery = RecoveryInProgress();
 	TransactionId *other_xids = ProcGlobal->xids;
 
-	/* inferred after ProcArrayLock is released */
-	h->catalog_oldest_nonremovable = InvalidTransactionId;
-
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	h->latest_completed = TransamVariables->latestCompletedXid;
@@ -1749,7 +1755,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 
 		h->oldest_considered_running = initial;
 		h->shared_oldest_nonremovable = initial;
+		h->catalog_oldest_nonremovable = initial;
 		h->data_oldest_nonremovable = initial;
+		h->data_safe_ic_oldest_nonremovable = initial;
 
 		/*
 		 * Only modifications made by this backend affect the horizon for
@@ -1847,11 +1855,28 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			(statusFlags & PROC_AFFECTS_ALL_HORIZONS) ||
 			in_recovery)
 		{
-			h->data_oldest_nonremovable =
-				TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+			h->data_safe_ic_oldest_nonremovable =
+					TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, xmin);
+
+			if (!(statusFlags & PROC_IN_SAFE_IC))
+				h->data_oldest_nonremovable =
+					TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+
+			/* Catalog tables need to consider all backends in this db */
+			h->catalog_oldest_nonremovable =
+				TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
+
 		}
 	}
 
+	/* catalog horizon should never be later than data */
+	Assert(TransactionIdPrecedesOrEquals(h->catalog_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+
+	/* data horizon should never be later than safe index building horizon */
+	Assert(TransactionIdPrecedesOrEquals(h->data_safe_ic_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+
 	/*
 	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
 	 * after lock is released.
@@ -1873,6 +1898,10 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
 		h->data_oldest_nonremovable =
 			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+		h->data_safe_ic_oldest_nonremovable =
+				TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, kaxmin);
+		h->catalog_oldest_nonremovable =
+			TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
 		/* temp relations cannot be accessed in recovery */
 	}
 
@@ -1880,6 +1909,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 										 h->shared_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->data_safe_ic_oldest_nonremovable));
 
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
@@ -1888,6 +1919,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
 	h->data_oldest_nonremovable =
 		TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
+	h->data_safe_ic_oldest_nonremovable =
+			TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, h->slot_xmin);
 
 	/*
 	 * The only difference between catalog / data horizons is that the slot's
@@ -1900,7 +1933,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	h->shared_oldest_nonremovable =
 		TransactionIdOlder(h->shared_oldest_nonremovable,
 						   h->slot_catalog_xmin);
-	h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+	h->catalog_oldest_nonremovable =
+		TransactionIdOlder(h->catalog_oldest_nonremovable,
+						   h->slot_xmin);
 	h->catalog_oldest_nonremovable =
 		TransactionIdOlder(h->catalog_oldest_nonremovable,
 						   h->slot_catalog_xmin);
@@ -1918,6 +1953,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	h->oldest_considered_running =
 		TransactionIdOlder(h->oldest_considered_running,
 						   h->data_oldest_nonremovable);
+	h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running,
+							   h->data_safe_ic_oldest_nonremovable);
 
 	/*
 	 * shared horizons have to be at least as old as the oldest visible in
@@ -1925,6 +1963,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	 */
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->data_safe_ic_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->catalog_oldest_nonremovable));
 
@@ -1938,6 +1978,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 										 h->catalog_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->data_safe_ic_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->temp_oldest_nonremovable));
 	Assert(!TransactionIdIsValid(h->slot_xmin) ||
@@ -1973,7 +2015,22 @@ GlobalVisHorizonKindForRel(Relation rel)
 			 RelationIsAccessibleInLogicalDecoding(rel))
 		return VISHORIZON_CATALOG;
 	else if (!RELATION_IS_LOCAL(rel))
-		return VISHORIZON_DATA;
+	{
+		// TODO: do we need to do something special about the TOAST?
+		if (!rel->rd_indexvalid)
+		{
+			// skip loading indexes if we know there is not safe concurrent index builds in the cluster
+			if (IsAnySafeIndexBuildsConcurrently())
+			{
+				RelationGetIndexList(rel);
+				Assert(rel->rd_indexvalid);
+
+				if (rel->rd_safeindexconcurrentlybuilding)
+					return VISHORIZON_DATA_SAFE_IC;
+			}
+			return VISHORIZON_DATA;
+		}
+	}
 	else
 		return VISHORIZON_TEMP;
 }
@@ -2004,6 +2061,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
 			return horizons.catalog_oldest_nonremovable;
 		case VISHORIZON_DATA:
 			return horizons.data_oldest_nonremovable;
+		case VISHORIZON_DATA_SAFE_IC:
+			return horizons.data_safe_ic_oldest_nonremovable;
 		case VISHORIZON_TEMP:
 			return horizons.temp_oldest_nonremovable;
 	}
@@ -2454,6 +2513,9 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisDataRels.definitely_needed =
 			FullTransactionIdNewer(def_vis_fxid_data,
 								   GlobalVisDataRels.definitely_needed);
+		GlobalVisDataSafeIcRels.definitely_needed =
+				FullTransactionIdNewer(def_vis_fxid_data,
+									   GlobalVisDataSafeIcRels.definitely_needed);
 		/* See temp_oldest_nonremovable computation in ComputeXidHorizons() */
 		if (TransactionIdIsNormal(myxid))
 			GlobalVisTempRels.definitely_needed =
@@ -2478,6 +2540,9 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisCatalogRels.maybe_needed =
 			FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
 								   oldestfxid);
+		GlobalVisDataSafeIcRels.maybe_needed =
+				FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+									   oldestfxid);
 		GlobalVisDataRels.maybe_needed =
 			FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 								   oldestfxid);
@@ -4106,6 +4171,9 @@ GlobalVisTestFor(Relation rel)
 		case VISHORIZON_DATA:
 			state = &GlobalVisDataRels;
 			break;
+		case VISHORIZON_DATA_SAFE_IC:
+			state = &GlobalVisDataSafeIcRels;
+			break;
 		case VISHORIZON_TEMP:
 			state = &GlobalVisTempRels;
 			break;
@@ -4158,6 +4226,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.maybe_needed =
 		FullXidRelativeTo(horizons->latest_completed,
 						  horizons->data_oldest_nonremovable);
+	GlobalVisDataSafeIcRels.maybe_needed =
+			FullXidRelativeTo(horizons->latest_completed,
+							  horizons->data_safe_ic_oldest_nonremovable);
 	GlobalVisTempRels.maybe_needed =
 		FullXidRelativeTo(horizons->latest_completed,
 						  horizons->temp_oldest_nonremovable);
@@ -4176,6 +4247,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.definitely_needed =
 		FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 							   GlobalVisDataRels.definitely_needed);
+	GlobalVisDataSafeIcRels.definitely_needed =
+			FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+								   GlobalVisDataSafeIcRels.definitely_needed);
 	GlobalVisTempRels.definitely_needed = GlobalVisTempRels.maybe_needed;
 
 	ComputeXidHorizonsResultLastXmin = RecentXmin;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 262c9878dd..21e8521ab8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -41,6 +41,7 @@
 #include "access/xact.h"
 #include "catalog/binary_upgrade.h"
 #include "catalog/catalog.h"
+#include "catalog/index.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
 #include "catalog/partition.h"
@@ -4769,6 +4770,7 @@ RelationGetIndexList(Relation relation)
 	Oid			pkeyIndex = InvalidOid;
 	Oid			candidateIndex = InvalidOid;
 	bool		pkdeferrable = false;
+	bool 		safeindexconcurrentlybuilding = false;
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the list. */
@@ -4809,6 +4811,14 @@ RelationGetIndexList(Relation relation)
 		/* add index's OID to result list */
 		result = lappend_oid(result, index->indexrelid);
 
+		/*
+		 * Consider index as building if it is ready but not yet valid.
+		 * Also, we must deal only with indexes which are built using the
+		 * concurrent safe mode.
+		 */
+		if (index->indisready && !index->indisvalid)
+			safeindexconcurrentlybuilding |= IsAnySafeIndexBuildsConcurrently();
+
 		/*
 		 * Non-unique or predicate indexes aren't interesting for either oid
 		 * indexes or replication identity indexes, so don't check them.
@@ -4869,6 +4879,7 @@ RelationGetIndexList(Relation relation)
 	relation->rd_indexlist = list_copy(result);
 	relation->rd_pkindex = pkeyIndex;
 	relation->rd_ispkdeferrable = pkdeferrable;
+	relation->rd_safeindexconcurrentlybuilding = safeindexconcurrentlybuilding;
 	if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex) && !pkdeferrable)
 		relation->rd_replidindex = pkeyIndex;
 	else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 0000000000..7b8afeead5
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0,c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	$node->pgbench(
+		'--no-vacuum --client=5 --transactions=25000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			'002_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			# Ensure some HOT updates happen
+			'002_pgbench_concurrent_transaction_updates' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  )
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	sleep(1);
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr);
+		while (1)
+		{
+
+			($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+			is($result, '0', 'REINDEX is correct');
+
+			($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', true, true);));
+			is($result, '0', 'bt_index_check is correct');
+ 			if ($result)
+ 			{
+				diag($stderr);
+ 			}
+
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+
+	$child->finalize();
+	$child->summary();
+	$node->stop;
+	done_testing();
+}
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..cac413e5eb 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,11 @@ extern void RestoreReindexState(const void *reindexstate);
 
 extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
 
+extern void SafeICStateShmemInit(void);
+// TODO: bound by relation or database
+extern void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment);
+extern bool IsAnySafeIndexBuildsConcurrently(void);
+
 
 /*
  * itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..e3c7899203 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -152,6 +152,7 @@ typedef struct RelationData
 	List	   *rd_indexlist;	/* list of OIDs of indexes on relation */
 	Oid			rd_pkindex;		/* OID of (deferrable?) primary key, if any */
 	bool		rd_ispkdeferrable;	/* is rd_pkindex a deferrable PK? */
+	bool		rd_safeindexconcurrentlybuilding; /* is safe concurrent index building in progress for relation */
 	Oid			rd_replidindex; /* OID of replica identity index, if any */
 
 	/* data managed by RelationGetStatExtList: */
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-05-07 20:23  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-05-07 20:23 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>

Hi again!

Made an error in `GlobalVisHorizonKindForRel` logic, and it was caught by a
new test.

Fixed version in attach.

>


Attachments:

  [text/x-patch] v3-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch (22.6K, 3-v3-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch)
  download | inline diff:
From 9a8ea366f6d2d144979e825c4ac0bdd2937bf7c1 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 7 May 2024 22:10:56 +0200
Subject: [PATCH v3] WIP: fix d9d076222f5b "VACUUM: ignore indexing operations 
 with CONCURRENTLY" which was reverted by e28bb8851969.

Issue was caused by absent of any snapshot actually protects the data in relation in the required to build index correctly.

Introduce new type of visibility horizon to be used for relation with concurrently build indexes (in the case of "safe" index).

Now `GlobalVisHorizonKindForRel` may dynamically decide which horizon to used base of the data about safe indexes being built concurrently.

To reduce performance impact counter of concurrently built indexes updated in shared memory.
---
 src/backend/catalog/index.c              |  36 ++++++
 src/backend/commands/indexcmds.c         |  20 +++
 src/backend/storage/ipc/ipci.c           |   2 +
 src/backend/storage/ipc/procarray.c      |  85 ++++++++++++-
 src/backend/utils/cache/relcache.c       |  11 ++
 src/bin/pg_amcheck/t/006_concurrently.pl | 155 +++++++++++++++++++++++
 src/include/catalog/index.h              |   5 +
 src/include/utils/rel.h                  |   1 +
 8 files changed, 309 insertions(+), 6 deletions(-)
 create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a8568c55c..3caa2bab12 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -97,6 +97,11 @@ typedef struct
 	Oid			pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
 } SerializedReindexState;
 
+typedef struct {
+	pg_atomic_uint32 numSafeConcurrentlyBuiltIndexes;
+} SafeICSharedState;
+static SafeICSharedState *SafeICStateShmem;
+
 /* non-export function prototypes */
 static bool relationHasPrimaryKey(Relation rel);
 static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -176,6 +181,37 @@ relationHasPrimaryKey(Relation rel)
 	return result;
 }
 
+
+void SafeICStateShmemInit(void)
+{
+	bool		found;
+
+	SafeICStateShmem = (SafeICSharedState *)
+			ShmemInitStruct("Safe Concurrently Build Indexes",
+							sizeof(SafeICSharedState),
+							&found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+		pg_atomic_init_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 0);
+	} else
+		Assert(found);
+}
+
+void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment)
+{
+	if (increment)
+		pg_atomic_fetch_add_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+	else
+		pg_atomic_fetch_sub_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+}
+
+bool IsAnySafeIndexBuildsConcurrently()
+{
+	return pg_atomic_read_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes) > 0;
+}
+
 /*
  * index_check_primary_key
  *		Apply special checks needed before creating a PRIMARY KEY index
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487..663450ba20 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1636,6 +1636,8 @@ DefineIndex(Oid tableId,
 	 * hold lock on the parent table.  This might need to change later.
 	 */
 	LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
+	if (safe_index && concurrent)
+		UpdateNumSafeConcurrentlyBuiltIndexes(true);
 
 	PopActiveSnapshot();
 	CommitTransactionCommand();
@@ -1804,7 +1806,15 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	/* Commit index as valid before reducing counter of safe concurrently build indexes */
+	CommitTransactionCommand();
 
+	Assert(concurrent);
+	if (safe_index)
+		UpdateNumSafeConcurrentlyBuiltIndexes(false);
+
+	/* Start a new transaction to finish process properly */
+	StartTransactionCommand();
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
 	 */
@@ -3902,6 +3912,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					 indexRel->rd_indpred == NIL);
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
+		if (idx->safe)
+			UpdateNumSafeConcurrentlyBuiltIndexes(true);
 
 		/* This function shouldn't be called for temporary relations. */
 		if (indexRel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
@@ -4345,6 +4357,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		UnlockRelationIdForSession(lockrelid, ShareUpdateExclusiveLock);
 	}
 
+	// now we may clear safe index building flags
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		if (newidx->safe)
+			UpdateNumSafeConcurrentlyBuiltIndexes(false);
+	}
+
 	/* Start a new transaction to finish process properly */
 	StartTransactionCommand();
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..260a634f1b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "catalog/index.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -357,6 +358,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
 	InjectionPointShmemInit();
+	SafeICStateShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 1a83c4220b..446df34dab 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -53,6 +53,7 @@
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/index.h"
 #include "catalog/pg_authid.h"
 #include "commands/dbcommands.h"
 #include "miscadmin.h"
@@ -236,6 +237,12 @@ typedef struct ComputeXidHorizonsResult
 	 */
 	TransactionId data_oldest_nonremovable;
 
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in normal user
+	 * defined tables with index building in progress by process with PROC_INSAFE_IC.
+	 */
+	TransactionId data_safe_ic_oldest_nonremovable;
+
 	/*
 	 * Oldest xid for which deleted tuples need to be retained in this
 	 * session's temporary tables.
@@ -251,6 +258,7 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_SHARED,
 	VISHORIZON_CATALOG,
 	VISHORIZON_DATA,
+	VISHORIZON_DATA_SAFE_IC,
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
@@ -297,6 +305,7 @@ static TransactionId standbySnapshotPendingXmin;
 static GlobalVisState GlobalVisSharedRels;
 static GlobalVisState GlobalVisCatalogRels;
 static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisDataSafeIcRels;
 static GlobalVisState GlobalVisTempRels;
 
 /*
@@ -1727,9 +1736,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	bool		in_recovery = RecoveryInProgress();
 	TransactionId *other_xids = ProcGlobal->xids;
 
-	/* inferred after ProcArrayLock is released */
-	h->catalog_oldest_nonremovable = InvalidTransactionId;
-
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	h->latest_completed = TransamVariables->latestCompletedXid;
@@ -1749,7 +1755,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 
 		h->oldest_considered_running = initial;
 		h->shared_oldest_nonremovable = initial;
+		h->catalog_oldest_nonremovable = initial;
 		h->data_oldest_nonremovable = initial;
+		h->data_safe_ic_oldest_nonremovable = initial;
 
 		/*
 		 * Only modifications made by this backend affect the horizon for
@@ -1847,11 +1855,28 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			(statusFlags & PROC_AFFECTS_ALL_HORIZONS) ||
 			in_recovery)
 		{
-			h->data_oldest_nonremovable =
-				TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+			h->data_safe_ic_oldest_nonremovable =
+					TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, xmin);
+
+			if (!(statusFlags & PROC_IN_SAFE_IC))
+				h->data_oldest_nonremovable =
+					TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+
+			/* Catalog tables need to consider all backends in this db */
+			h->catalog_oldest_nonremovable =
+				TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
+
 		}
 	}
 
+	/* catalog horizon should never be later than data */
+	Assert(TransactionIdPrecedesOrEquals(h->catalog_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+
+	/* data horizon should never be later than safe index building horizon */
+	Assert(TransactionIdPrecedesOrEquals(h->data_safe_ic_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+
 	/*
 	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
 	 * after lock is released.
@@ -1873,6 +1898,10 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
 		h->data_oldest_nonremovable =
 			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+		h->data_safe_ic_oldest_nonremovable =
+				TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, kaxmin);
+		h->catalog_oldest_nonremovable =
+			TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
 		/* temp relations cannot be accessed in recovery */
 	}
 
@@ -1880,6 +1909,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 										 h->shared_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->data_safe_ic_oldest_nonremovable));
 
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
@@ -1888,6 +1919,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
 	h->data_oldest_nonremovable =
 		TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
+	h->data_safe_ic_oldest_nonremovable =
+			TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, h->slot_xmin);
 
 	/*
 	 * The only difference between catalog / data horizons is that the slot's
@@ -1900,7 +1933,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	h->shared_oldest_nonremovable =
 		TransactionIdOlder(h->shared_oldest_nonremovable,
 						   h->slot_catalog_xmin);
-	h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+	h->catalog_oldest_nonremovable =
+		TransactionIdOlder(h->catalog_oldest_nonremovable,
+						   h->slot_xmin);
 	h->catalog_oldest_nonremovable =
 		TransactionIdOlder(h->catalog_oldest_nonremovable,
 						   h->slot_catalog_xmin);
@@ -1918,6 +1953,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	h->oldest_considered_running =
 		TransactionIdOlder(h->oldest_considered_running,
 						   h->data_oldest_nonremovable);
+	h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running,
+							   h->data_safe_ic_oldest_nonremovable);
 
 	/*
 	 * shared horizons have to be at least as old as the oldest visible in
@@ -1925,6 +1963,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	 */
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->data_safe_ic_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->catalog_oldest_nonremovable));
 
@@ -1938,6 +1978,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 										 h->catalog_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->data_safe_ic_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->temp_oldest_nonremovable));
 	Assert(!TransactionIdIsValid(h->slot_xmin) ||
@@ -1973,7 +2015,21 @@ GlobalVisHorizonKindForRel(Relation rel)
 			 RelationIsAccessibleInLogicalDecoding(rel))
 		return VISHORIZON_CATALOG;
 	else if (!RELATION_IS_LOCAL(rel))
+	{
+		// TODO: do we need to do something special about the TOAST?
+		if (!rel->rd_indexvalid)
+		{
+			// skip loading indexes if we know there is not safe concurrent index builds in the cluster
+			if (IsAnySafeIndexBuildsConcurrently())
+			{
+				RelationGetIndexList(rel);
+				Assert(rel->rd_indexvalid);
+			} else return VISHORIZON_DATA;
+		}
+		if (rel->rd_safeindexconcurrentlybuilding)
+			return VISHORIZON_DATA_SAFE_IC;
 		return VISHORIZON_DATA;
+	}
 	else
 		return VISHORIZON_TEMP;
 }
@@ -2004,6 +2060,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
 			return horizons.catalog_oldest_nonremovable;
 		case VISHORIZON_DATA:
 			return horizons.data_oldest_nonremovable;
+		case VISHORIZON_DATA_SAFE_IC:
+			return horizons.data_safe_ic_oldest_nonremovable;
 		case VISHORIZON_TEMP:
 			return horizons.temp_oldest_nonremovable;
 	}
@@ -2454,6 +2512,9 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisDataRels.definitely_needed =
 			FullTransactionIdNewer(def_vis_fxid_data,
 								   GlobalVisDataRels.definitely_needed);
+		GlobalVisDataSafeIcRels.definitely_needed =
+				FullTransactionIdNewer(def_vis_fxid_data,
+									   GlobalVisDataSafeIcRels.definitely_needed);
 		/* See temp_oldest_nonremovable computation in ComputeXidHorizons() */
 		if (TransactionIdIsNormal(myxid))
 			GlobalVisTempRels.definitely_needed =
@@ -2478,6 +2539,9 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisCatalogRels.maybe_needed =
 			FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
 								   oldestfxid);
+		GlobalVisDataSafeIcRels.maybe_needed =
+				FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+									   oldestfxid);
 		GlobalVisDataRels.maybe_needed =
 			FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 								   oldestfxid);
@@ -4106,6 +4170,9 @@ GlobalVisTestFor(Relation rel)
 		case VISHORIZON_DATA:
 			state = &GlobalVisDataRels;
 			break;
+		case VISHORIZON_DATA_SAFE_IC:
+			state = &GlobalVisDataSafeIcRels;
+			break;
 		case VISHORIZON_TEMP:
 			state = &GlobalVisTempRels;
 			break;
@@ -4158,6 +4225,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.maybe_needed =
 		FullXidRelativeTo(horizons->latest_completed,
 						  horizons->data_oldest_nonremovable);
+	GlobalVisDataSafeIcRels.maybe_needed =
+			FullXidRelativeTo(horizons->latest_completed,
+							  horizons->data_safe_ic_oldest_nonremovable);
 	GlobalVisTempRels.maybe_needed =
 		FullXidRelativeTo(horizons->latest_completed,
 						  horizons->temp_oldest_nonremovable);
@@ -4176,6 +4246,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.definitely_needed =
 		FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 							   GlobalVisDataRels.definitely_needed);
+	GlobalVisDataSafeIcRels.definitely_needed =
+			FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+								   GlobalVisDataSafeIcRels.definitely_needed);
 	GlobalVisTempRels.definitely_needed = GlobalVisTempRels.maybe_needed;
 
 	ComputeXidHorizonsResultLastXmin = RecentXmin;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 262c9878dd..21e8521ab8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -41,6 +41,7 @@
 #include "access/xact.h"
 #include "catalog/binary_upgrade.h"
 #include "catalog/catalog.h"
+#include "catalog/index.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
 #include "catalog/partition.h"
@@ -4769,6 +4770,7 @@ RelationGetIndexList(Relation relation)
 	Oid			pkeyIndex = InvalidOid;
 	Oid			candidateIndex = InvalidOid;
 	bool		pkdeferrable = false;
+	bool 		safeindexconcurrentlybuilding = false;
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the list. */
@@ -4809,6 +4811,14 @@ RelationGetIndexList(Relation relation)
 		/* add index's OID to result list */
 		result = lappend_oid(result, index->indexrelid);
 
+		/*
+		 * Consider index as building if it is ready but not yet valid.
+		 * Also, we must deal only with indexes which are built using the
+		 * concurrent safe mode.
+		 */
+		if (index->indisready && !index->indisvalid)
+			safeindexconcurrentlybuilding |= IsAnySafeIndexBuildsConcurrently();
+
 		/*
 		 * Non-unique or predicate indexes aren't interesting for either oid
 		 * indexes or replication identity indexes, so don't check them.
@@ -4869,6 +4879,7 @@ RelationGetIndexList(Relation relation)
 	relation->rd_indexlist = list_copy(result);
 	relation->rd_pkindex = pkeyIndex;
 	relation->rd_ispkdeferrable = pkdeferrable;
+	relation->rd_safeindexconcurrentlybuilding = safeindexconcurrentlybuilding;
 	if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex) && !pkdeferrable)
 		relation->rd_replidindex = pkeyIndex;
 	else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 0000000000..7b8afeead5
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0,c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	$node->pgbench(
+		'--no-vacuum --client=5 --transactions=25000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			'002_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			# Ensure some HOT updates happen
+			'002_pgbench_concurrent_transaction_updates' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  )
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	sleep(1);
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr);
+		while (1)
+		{
+
+			($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+			is($result, '0', 'REINDEX is correct');
+
+			($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', true, true);));
+			is($result, '0', 'bt_index_check is correct');
+ 			if ($result)
+ 			{
+				diag($stderr);
+ 			}
+
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+
+	$child->finalize();
+	$child->summary();
+	$node->stop;
+	done_testing();
+}
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..cac413e5eb 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,11 @@ extern void RestoreReindexState(const void *reindexstate);
 
 extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
 
+extern void SafeICStateShmemInit(void);
+// TODO: bound by relation or database
+extern void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment);
+extern bool IsAnySafeIndexBuildsConcurrently(void);
+
 
 /*
  * itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..e3c7899203 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -152,6 +152,7 @@ typedef struct RelationData
 	List	   *rd_indexlist;	/* list of OIDs of indexes on relation */
 	Oid			rd_pkindex;		/* OID of (deferrable?) primary key, if any */
 	bool		rd_ispkdeferrable;	/* is rd_pkindex a deferrable PK? */
+	bool		rd_safeindexconcurrentlybuilding; /* is safe concurrent index building in progress for relation */
 	Oid			rd_replidindex; /* OID of replica identity index, if any */
 
 	/* data managed by RelationGetStatExtList: */
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-05-09 13:00  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-05-09 13:00 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>

Hello, Matthias and others!

Realized new horizon was applied only during validation phase (once index
is marked as ready).
Now it applied if index is not marked as valid yet.

Updated version in attach.

--------------------------------------------------

> I think the best way for this to work would be an index method that
> exclusively stores TIDs, and of which we can quickly determine new
> tuples, too. I was thinking about something like GIN's format, but
> using (generation number, tid) instead of ([colno, colvalue], tid) as
> key data for the internal trees, and would be unlogged (because the
> data wouldn't have to survive a crash). Then we could do something
> like this for the second table scan phase:

Regarding that approach to dealing with validation phase and resetting of
snapshot:

I was thinking about it and realized: once we go for an additional index -
we don't need the second heap scan at all!

We may do it this way:

* create target index, not marked as indisready yet
* create a temporary unlogged index with the same parameters to store tids
(optionally with the indexes columns data, see below), marked as indisready
(but not indisvalid)
* commit them both in a single transaction
* wait for other transaction to know about them and honor in HOT
constraints and new inserts (for temporary index)
* now our temporary index is filled by the tuples inserted to the table
* start building out target index, resetting snapshot every so often (if it
is "safe" index)
* finish target index building phase
* mark target index as indisready
* now, start validation of the index:
    * take the reference snapshot
    * take a visibility snapshot of the target index, sort it (as it done
currently)
    * take a visibility snapshot of our temporary index, sort it
    * start merging loop using two synchronized cursors over both
visibility snapshots
        * if we encountered tid which is not present in target visibility
snapshot
            * insert it to target index
                * if a temporary index contains the column's data - we may
even avoid the tuple fetch
                * if temporary index is tid-only - we fetch tuple from the
heap, but as plus we are also skipping dead tuples from insertion to the
new index (I think it is better option)
    * commit everything, release reference snapshot
* wait for transactions older than reference snapshot (as it done currently)
* mark target index as indisvalid, drop temporary index
* done


So, pros:
* just a single heap scan
* snapshot is reset periodically

Cons:
* we need to maintain the additional index during the main building phase
* one more tuplesort

If the temporary index is unlogged, cheap to maintain (just append-only
mechanics) this feels like a perfect tradeoff for me.

This approach will work perfectly with low amount of tuple inserts during
the building phase. And looks like even in the worst case it still better
than the current approach.

What do you think? Have I missed something?

Thanks,
Michail.


Attachments:

  [text/x-patch] v4-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch (22.6K, 3-v4-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch)
  download | inline diff:
From 4878cc22c9176e5bf2b7d3d9d8c95cc66c8ac007 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Wed, 8 May 2024 22:31:33 +0200
Subject: [PATCH v4] WIP: fix d9d076222f5b "VACUUM: ignore indexing operations 
 with CONCURRENTLY" which was reverted by e28bb8851969.

Issue was caused by absent of any snapshot actually protects the data in relation in the required to build index correctly.

Introduce new type of visibility horizon to be used for relation with concurrently build indexes (in the case of "safe" index).

Now `GlobalVisHorizonKindForRel` may dynamically decide which horizon to used base of the data about safe indexes being built concurrently.

To reduce performance impact counter of concurrently built indexes updated in shared memory.
---
 src/backend/catalog/index.c              |  36 ++++++
 src/backend/commands/indexcmds.c         |  20 +++
 src/backend/storage/ipc/ipci.c           |   2 +
 src/backend/storage/ipc/procarray.c      |  85 ++++++++++++-
 src/backend/utils/cache/relcache.c       |  11 ++
 src/bin/pg_amcheck/t/006_concurrently.pl | 155 +++++++++++++++++++++++
 src/include/catalog/index.h              |   5 +
 src/include/utils/rel.h                  |   1 +
 8 files changed, 309 insertions(+), 6 deletions(-)
 create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a8568c55c..3caa2bab12 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -97,6 +97,11 @@ typedef struct
 	Oid			pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
 } SerializedReindexState;
 
+typedef struct {
+	pg_atomic_uint32 numSafeConcurrentlyBuiltIndexes;
+} SafeICSharedState;
+static SafeICSharedState *SafeICStateShmem;
+
 /* non-export function prototypes */
 static bool relationHasPrimaryKey(Relation rel);
 static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -176,6 +181,37 @@ relationHasPrimaryKey(Relation rel)
 	return result;
 }
 
+
+void SafeICStateShmemInit(void)
+{
+	bool		found;
+
+	SafeICStateShmem = (SafeICSharedState *)
+			ShmemInitStruct("Safe Concurrently Build Indexes",
+							sizeof(SafeICSharedState),
+							&found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+		pg_atomic_init_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 0);
+	} else
+		Assert(found);
+}
+
+void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment)
+{
+	if (increment)
+		pg_atomic_fetch_add_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+	else
+		pg_atomic_fetch_sub_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+}
+
+bool IsAnySafeIndexBuildsConcurrently()
+{
+	return pg_atomic_read_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes) > 0;
+}
+
 /*
  * index_check_primary_key
  *		Apply special checks needed before creating a PRIMARY KEY index
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487..663450ba20 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1636,6 +1636,8 @@ DefineIndex(Oid tableId,
 	 * hold lock on the parent table.  This might need to change later.
 	 */
 	LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
+	if (safe_index && concurrent)
+		UpdateNumSafeConcurrentlyBuiltIndexes(true);
 
 	PopActiveSnapshot();
 	CommitTransactionCommand();
@@ -1804,7 +1806,15 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	/* Commit index as valid before reducing counter of safe concurrently build indexes */
+	CommitTransactionCommand();
 
+	Assert(concurrent);
+	if (safe_index)
+		UpdateNumSafeConcurrentlyBuiltIndexes(false);
+
+	/* Start a new transaction to finish process properly */
+	StartTransactionCommand();
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
 	 */
@@ -3902,6 +3912,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					 indexRel->rd_indpred == NIL);
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
+		if (idx->safe)
+			UpdateNumSafeConcurrentlyBuiltIndexes(true);
 
 		/* This function shouldn't be called for temporary relations. */
 		if (indexRel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
@@ -4345,6 +4357,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		UnlockRelationIdForSession(lockrelid, ShareUpdateExclusiveLock);
 	}
 
+	// now we may clear safe index building flags
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		if (newidx->safe)
+			UpdateNumSafeConcurrentlyBuiltIndexes(false);
+	}
+
 	/* Start a new transaction to finish process properly */
 	StartTransactionCommand();
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..260a634f1b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "catalog/index.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -357,6 +358,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
 	InjectionPointShmemInit();
+	SafeICStateShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 1a83c4220b..446df34dab 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -53,6 +53,7 @@
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/index.h"
 #include "catalog/pg_authid.h"
 #include "commands/dbcommands.h"
 #include "miscadmin.h"
@@ -236,6 +237,12 @@ typedef struct ComputeXidHorizonsResult
 	 */
 	TransactionId data_oldest_nonremovable;
 
+	/*
+	 * Oldest xid for which deleted tuples need to be retained in normal user
+	 * defined tables with index building in progress by process with PROC_INSAFE_IC.
+	 */
+	TransactionId data_safe_ic_oldest_nonremovable;
+
 	/*
 	 * Oldest xid for which deleted tuples need to be retained in this
 	 * session's temporary tables.
@@ -251,6 +258,7 @@ typedef enum GlobalVisHorizonKind
 	VISHORIZON_SHARED,
 	VISHORIZON_CATALOG,
 	VISHORIZON_DATA,
+	VISHORIZON_DATA_SAFE_IC,
 	VISHORIZON_TEMP,
 } GlobalVisHorizonKind;
 
@@ -297,6 +305,7 @@ static TransactionId standbySnapshotPendingXmin;
 static GlobalVisState GlobalVisSharedRels;
 static GlobalVisState GlobalVisCatalogRels;
 static GlobalVisState GlobalVisDataRels;
+static GlobalVisState GlobalVisDataSafeIcRels;
 static GlobalVisState GlobalVisTempRels;
 
 /*
@@ -1727,9 +1736,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	bool		in_recovery = RecoveryInProgress();
 	TransactionId *other_xids = ProcGlobal->xids;
 
-	/* inferred after ProcArrayLock is released */
-	h->catalog_oldest_nonremovable = InvalidTransactionId;
-
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	h->latest_completed = TransamVariables->latestCompletedXid;
@@ -1749,7 +1755,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 
 		h->oldest_considered_running = initial;
 		h->shared_oldest_nonremovable = initial;
+		h->catalog_oldest_nonremovable = initial;
 		h->data_oldest_nonremovable = initial;
+		h->data_safe_ic_oldest_nonremovable = initial;
 
 		/*
 		 * Only modifications made by this backend affect the horizon for
@@ -1847,11 +1855,28 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			(statusFlags & PROC_AFFECTS_ALL_HORIZONS) ||
 			in_recovery)
 		{
-			h->data_oldest_nonremovable =
-				TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+			h->data_safe_ic_oldest_nonremovable =
+					TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, xmin);
+
+			if (!(statusFlags & PROC_IN_SAFE_IC))
+				h->data_oldest_nonremovable =
+					TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+
+			/* Catalog tables need to consider all backends in this db */
+			h->catalog_oldest_nonremovable =
+				TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
+
 		}
 	}
 
+	/* catalog horizon should never be later than data */
+	Assert(TransactionIdPrecedesOrEquals(h->catalog_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+
+	/* data horizon should never be later than safe index building horizon */
+	Assert(TransactionIdPrecedesOrEquals(h->data_safe_ic_oldest_nonremovable,
+										 h->data_oldest_nonremovable));
+
 	/*
 	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
 	 * after lock is released.
@@ -1873,6 +1898,10 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
 		h->data_oldest_nonremovable =
 			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+		h->data_safe_ic_oldest_nonremovable =
+				TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, kaxmin);
+		h->catalog_oldest_nonremovable =
+			TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
 		/* temp relations cannot be accessed in recovery */
 	}
 
@@ -1880,6 +1909,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 										 h->shared_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->data_safe_ic_oldest_nonremovable));
 
 	/*
 	 * Check whether there are replication slots requiring an older xmin.
@@ -1888,6 +1919,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		TransactionIdOlder(h->shared_oldest_nonremovable, h->slot_xmin);
 	h->data_oldest_nonremovable =
 		TransactionIdOlder(h->data_oldest_nonremovable, h->slot_xmin);
+	h->data_safe_ic_oldest_nonremovable =
+			TransactionIdOlder(h->data_safe_ic_oldest_nonremovable, h->slot_xmin);
 
 	/*
 	 * The only difference between catalog / data horizons is that the slot's
@@ -1900,7 +1933,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	h->shared_oldest_nonremovable =
 		TransactionIdOlder(h->shared_oldest_nonremovable,
 						   h->slot_catalog_xmin);
-	h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
+	h->catalog_oldest_nonremovable =
+		TransactionIdOlder(h->catalog_oldest_nonremovable,
+						   h->slot_xmin);
 	h->catalog_oldest_nonremovable =
 		TransactionIdOlder(h->catalog_oldest_nonremovable,
 						   h->slot_catalog_xmin);
@@ -1918,6 +1953,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	h->oldest_considered_running =
 		TransactionIdOlder(h->oldest_considered_running,
 						   h->data_oldest_nonremovable);
+	h->oldest_considered_running =
+			TransactionIdOlder(h->oldest_considered_running,
+							   h->data_safe_ic_oldest_nonremovable);
 
 	/*
 	 * shared horizons have to be at least as old as the oldest visible in
@@ -1925,6 +1963,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	 */
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+										 h->data_safe_ic_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
 										 h->catalog_oldest_nonremovable));
 
@@ -1938,6 +1978,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 										 h->catalog_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->data_oldest_nonremovable));
+	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+										 h->data_safe_ic_oldest_nonremovable));
 	Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
 										 h->temp_oldest_nonremovable));
 	Assert(!TransactionIdIsValid(h->slot_xmin) ||
@@ -1973,7 +2015,21 @@ GlobalVisHorizonKindForRel(Relation rel)
 			 RelationIsAccessibleInLogicalDecoding(rel))
 		return VISHORIZON_CATALOG;
 	else if (!RELATION_IS_LOCAL(rel))
+	{
+		// TODO: do we need to do something special about the TOAST?
+		if (!rel->rd_indexvalid)
+		{
+			// skip loading indexes if we know there is not safe concurrent index builds in the cluster
+			if (IsAnySafeIndexBuildsConcurrently())
+			{
+				RelationGetIndexList(rel);
+				Assert(rel->rd_indexvalid);
+			} else return VISHORIZON_DATA;
+		}
+		if (rel->rd_safeindexconcurrentlybuilding)
+			return VISHORIZON_DATA_SAFE_IC;
 		return VISHORIZON_DATA;
+	}
 	else
 		return VISHORIZON_TEMP;
 }
@@ -2004,6 +2060,8 @@ GetOldestNonRemovableTransactionId(Relation rel)
 			return horizons.catalog_oldest_nonremovable;
 		case VISHORIZON_DATA:
 			return horizons.data_oldest_nonremovable;
+		case VISHORIZON_DATA_SAFE_IC:
+			return horizons.data_safe_ic_oldest_nonremovable;
 		case VISHORIZON_TEMP:
 			return horizons.temp_oldest_nonremovable;
 	}
@@ -2454,6 +2512,9 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisDataRels.definitely_needed =
 			FullTransactionIdNewer(def_vis_fxid_data,
 								   GlobalVisDataRels.definitely_needed);
+		GlobalVisDataSafeIcRels.definitely_needed =
+				FullTransactionIdNewer(def_vis_fxid_data,
+									   GlobalVisDataSafeIcRels.definitely_needed);
 		/* See temp_oldest_nonremovable computation in ComputeXidHorizons() */
 		if (TransactionIdIsNormal(myxid))
 			GlobalVisTempRels.definitely_needed =
@@ -2478,6 +2539,9 @@ GetSnapshotData(Snapshot snapshot)
 		GlobalVisCatalogRels.maybe_needed =
 			FullTransactionIdNewer(GlobalVisCatalogRels.maybe_needed,
 								   oldestfxid);
+		GlobalVisDataSafeIcRels.maybe_needed =
+				FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+									   oldestfxid);
 		GlobalVisDataRels.maybe_needed =
 			FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 								   oldestfxid);
@@ -4106,6 +4170,9 @@ GlobalVisTestFor(Relation rel)
 		case VISHORIZON_DATA:
 			state = &GlobalVisDataRels;
 			break;
+		case VISHORIZON_DATA_SAFE_IC:
+			state = &GlobalVisDataSafeIcRels;
+			break;
 		case VISHORIZON_TEMP:
 			state = &GlobalVisTempRels;
 			break;
@@ -4158,6 +4225,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.maybe_needed =
 		FullXidRelativeTo(horizons->latest_completed,
 						  horizons->data_oldest_nonremovable);
+	GlobalVisDataSafeIcRels.maybe_needed =
+			FullXidRelativeTo(horizons->latest_completed,
+							  horizons->data_safe_ic_oldest_nonremovable);
 	GlobalVisTempRels.maybe_needed =
 		FullXidRelativeTo(horizons->latest_completed,
 						  horizons->temp_oldest_nonremovable);
@@ -4176,6 +4246,9 @@ GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons)
 	GlobalVisDataRels.definitely_needed =
 		FullTransactionIdNewer(GlobalVisDataRels.maybe_needed,
 							   GlobalVisDataRels.definitely_needed);
+	GlobalVisDataSafeIcRels.definitely_needed =
+			FullTransactionIdNewer(GlobalVisDataSafeIcRels.maybe_needed,
+								   GlobalVisDataSafeIcRels.definitely_needed);
 	GlobalVisTempRels.definitely_needed = GlobalVisTempRels.maybe_needed;
 
 	ComputeXidHorizonsResultLastXmin = RecentXmin;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 262c9878dd..93b7794b48 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -41,6 +41,7 @@
 #include "access/xact.h"
 #include "catalog/binary_upgrade.h"
 #include "catalog/catalog.h"
+#include "catalog/index.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
 #include "catalog/partition.h"
@@ -4769,6 +4770,7 @@ RelationGetIndexList(Relation relation)
 	Oid			pkeyIndex = InvalidOid;
 	Oid			candidateIndex = InvalidOid;
 	bool		pkdeferrable = false;
+	bool 		safeindexconcurrentlybuilding = false;
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the list. */
@@ -4809,6 +4811,14 @@ RelationGetIndexList(Relation relation)
 		/* add index's OID to result list */
 		result = lappend_oid(result, index->indexrelid);
 
+		/*
+		 * Consider index as building if it is not yet valid.
+		 * Also, we must deal only with indexes which are built using the
+		 * concurrent safe mode.
+		 */
+		if (!index->indisvalid)
+			safeindexconcurrentlybuilding |= IsAnySafeIndexBuildsConcurrently();
+
 		/*
 		 * Non-unique or predicate indexes aren't interesting for either oid
 		 * indexes or replication identity indexes, so don't check them.
@@ -4869,6 +4879,7 @@ RelationGetIndexList(Relation relation)
 	relation->rd_indexlist = list_copy(result);
 	relation->rd_pkindex = pkeyIndex;
 	relation->rd_ispkdeferrable = pkdeferrable;
+	relation->rd_safeindexconcurrentlybuilding = safeindexconcurrentlybuilding;
 	if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex) && !pkdeferrable)
 		relation->rd_replidindex = pkeyIndex;
 	else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 0000000000..7b8afeead5
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,155 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0,c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	$node->pgbench(
+		'--no-vacuum --client=5 --transactions=25000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			'002_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			# Ensure some HOT updates happen
+			'002_pgbench_concurrent_transaction_updates' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  )
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	sleep(1);
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr);
+		while (1)
+		{
+
+			($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+			is($result, '0', 'REINDEX is correct');
+
+			($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', true, true);));
+			is($result, '0', 'bt_index_check is correct');
+ 			if ($result)
+ 			{
+				diag($stderr);
+ 			}
+
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+
+	$child->finalize();
+	$child->summary();
+	$node->stop;
+	done_testing();
+}
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c..cac413e5eb 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -175,6 +175,11 @@ extern void RestoreReindexState(const void *reindexstate);
 
 extern void IndexSetParentIndex(Relation partitionIdx, Oid parentOid);
 
+extern void SafeICStateShmemInit(void);
+// TODO: bound by relation or database
+extern void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment);
+extern bool IsAnySafeIndexBuildsConcurrently(void);
+
 
 /*
  * itemptr_encode - Encode ItemPointer as int64/int8
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..e3c7899203 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -152,6 +152,7 @@ typedef struct RelationData
 	List	   *rd_indexlist;	/* list of OIDs of indexes on relation */
 	Oid			rd_pkindex;		/* OID of (deferrable?) primary key, if any */
 	bool		rd_ispkdeferrable;	/* is rd_pkindex a deferrable PK? */
+	bool		rd_safeindexconcurrentlybuilding; /* is safe concurrent index building in progress for relation */
 	Oid			rd_replidindex; /* OID of replica identity index, if any */
 
 	/* data managed by RelationGetStatExtList: */
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-06-11 08:58  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-06-11 08:58 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Melanie Plageman <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>

Hello.

I did the POC (1) of the method described in the previous email, and it
looks promising.

It doesn't block the VACUUM, indexes are built about 30% faster (22 mins vs
15 mins). Additional index is lightweight and does not produce any WAL.

I'll continue the more stress testing for a while. Also, I need to
restructure the commits (my path was no direct) into some meaningful and
reviewable patches.

[1]
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrent...


^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-08-06 23:40  Matthias van de Meent <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Matthias van de Meent @ 2024-08-06 23:40 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Tue, 11 Jun 2024 at 10:58, Michail Nikolaev
<[email protected]> wrote:
>
> Hello.
>
> I did the POC (1) of the method described in the previous email, and it looks promising.
>
> It doesn't block the VACUUM, indexes are built about 30% faster (22 mins vs 15 mins).

That's a nice improvement.

> Additional index is lightweight and does not produce any WAL.

That doesn't seem to be what I see in the current patchset:
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrent...

> I'll continue the more stress testing for a while. Also, I need to restructure the commits (my path was no direct) into some meaningful and reviewable patches.

While waiting for this, here are some initial comments on the github diffs:

- I notice you've added a new argument to
heapam_index_build_range_scan. I think this could just as well be
implemented by reading the indexInfo->ii_Concurrent field, as the
values should be equivalent, right?

- In heapam_index_build_range_scan, it seems like you're popping the
snapshot and registering a new one while holding a tuple from
heap_getnext(), thus while holding a page lock. I'm not so sure that's
OK, expecially when catalogs are also involved (specifically for
expression indexes, where functions could potentially be updated or
dropped if we re-create the visibility snapshot)

- In heapam_index_build_range_scan, you pop the snapshot before the
returned heaptuple is processed and passed to the index-provided
callback. I think that's incorrect, as it'll change the visibility of
the returned tuple before it's passed to the index's callback. I think
the snapshot manipulation is best added at the end of the loop, if we
add it at all in that function.

- The snapshot reset interval is quite high, at 500ms. Why did you
configure it that low, and didn't you make this configurable?

- You seem to be using WAL in the STIR index, while it doesn't seem
that relevant for the use case of auxiliary indexes that won't return
any data and are only used on the primary. It would imply that the
data is being sent to replicas and more data being written than
strictly necessary, which to me seems wasteful.

- The locking in stirinsert can probably be improved significantly if
we use things like atomic operations on STIR pages. We'd need an
exclusive lock only for page initialization, while share locks are
enough if the page's data is modified without WAL. That should improve
concurrent insert performance significantly, as it would further
reduce the length of the exclusively locked hot path.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)






^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-08-08 13:53  Michail Nikolaev <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-08-08 13:53 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, Matthias!

> While waiting for this, here are some initial comments on the github
diffs:

Thanks for your review!
While stress testing the POC, I found some issues unrelated to the patch
that need to be fixed first.
This is [1] and [2].

>> Additional index is lightweight and does not produce any WAL.
> That doesn't seem to be what I see in the current patchset:

Persistence is passed as parameter [3] and set to RELPERSISTENCE_UNLOGGED
for auxiliary indexes [4].

> - I notice you've added a new argument to
> heapam_index_build_range_scan. I think this could just as well be
> implemented by reading the indexInfo->ii_Concurrent field, as the
> values should be equivalent, right?

Not always; currently, it is set by ResetSnapshotsAllowed[5].
We fall back to regular index build if there is a predicate or expression
in the index (which should be considered "safe" according to [6]).
However, we may remove this check later.
Additionally, there is no sense in resetting the snapshot if we already
have an xmin assigned to the backend for some reason.

> In heapam_index_build_range_scan, it seems like you're popping the
> snapshot and registering a new one while holding a tuple from
> heap_getnext(), thus while holding a page lock. I'm not so sure that's
> OK, expecially when catalogs are also involved (specifically for
> expression indexes, where functions could potentially be updated or
> dropped if we re-create the visibility snapshot)

Yeah, good catch.
Initially, I implemented a different approach by extracting the catalog
xmin to a separate horizon [7]. It might be better to return to this option.

> In heapam_index_build_range_scan, you pop the snapshot before the
> returned heaptuple is processed and passed to the index-provided
> callback. I think that's incorrect, as it'll change the visibility of
> the returned tuple before it's passed to the index's callback. I think
> the snapshot manipulation is best added at the end of the loop, if we
> add it at all in that function.

Yes, this needs to be fixed as well.

> The snapshot reset interval is quite high, at 500ms. Why did you
> configure it that low, and didn't you make this configurable?

It is just a random value for testing purposes.
I don't think there is a need to make it configurable.
Getting a new snapshot is a cheap operation now, so we can do it more often
if required.
Internally, I was testing it with a 0ms interval.

> You seem to be using WAL in the STIR index, while it doesn't seem
> that relevant for the use case of auxiliary indexes that won't return
> any data and are only used on the primary. It would imply that the
> data is being sent to replicas and more data being written than
> strictly necessary, which to me seems wasteful.

It just looks like an index with WAL, but as mentioned above, it is
unlogged in actual usage.

> The locking in stirinsert can probably be improved significantly if
> we use things like atomic operations on STIR pages. We'd need an
> exclusive lock only for page initialization, while share locks are
> enough if the page's data is modified without WAL. That should improve
> concurrent insert performance significantly, as it would further
> reduce the length of the exclusively locked hot path.

Hm, good idea. I'll check it later.

Best regards & thanks again,
Mikhail

[1]:
https://www.postgresql.org/message-id/CANtu0ohHmYXsK5bxU9Thcq1FbELLAk0S2Zap0r8AnU3OTmcCOA%40mail.gma...
[2]:
https://www.postgresql.org/message-id/CANtu0ojga8s9%2BJ89cAgLzn2e-bQgy3L0iQCKaCnTL%3Dppot%3Dqhw%40ma...
[3]:
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrent...
[4]:
https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backe...
[5]:
https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backe...
[6]:
https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backe...
[7]:
https://github.com/postgres/postgres/commit/38b243d6cc7358a44cb1a865b919bf9633825b0c


^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-09-01 21:19  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-09-01 21:19 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, Matthias!

Just wanted to update you with some information about the next steps in
work.

> In heapam_index_build_range_scan, it seems like you're popping the
> snapshot and registering a new one while holding a tuple from
> heap_getnext(), thus while holding a page lock. I'm not so sure that's
> OK, expecially when catalogs are also involved (specifically for
> expression indexes, where functions could potentially be updated or
> dropped if we re-create the visibility snapshot)

I have returned to the solution with a dedicated catalog_xmin for backends
[1].
Additionally, I have added catalog_xmin to pg_stat_activity [2].

> In heapam_index_build_range_scan, you pop the snapshot before the
> returned heaptuple is processed and passed to the index-provided
> callback. I think that's incorrect, as it'll change the visibility of
> the returned tuple before it's passed to the index's callback. I think
> the snapshot manipulation is best added at the end of the loop, if we
> add it at all in that function.

Now it's fixed, and the snapshot is reset between pages [3].

Additionally, I resolved the issue with potential duplicates in unique
indexes. It looks a bit clunky, but it works for now [4].

Single commit from [5] also included, just for stable stress testing.

Full diff is available at [6].

Best regards,
Mikhail.

[1]:
https://github.com/michail-nikolaev/postgres/commit/01a47623571592c52c7a367f85b1cff9d8b593c0
[2]:
https://github.com/michail-nikolaev/postgres/commit/d3345d60bd51fe2e0e4a73806774b828f34ba7b6
[3]:
https://github.com/michail-nikolaev/postgres/commit/7d1dd4f971e8d03f38de95f82b730635ffe09aaf
[4]:
https://github.com/michail-nikolaev/postgres/commit/4ad56e14dd504d5530657069068c2bdf172e482d
[5]: https://commitfest.postgresql.org/49/5160/
[6]:
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrent...


^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-09-08 15:18  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-09-08 15:18 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, Matthias!

>> - I notice you've added a new argument to
>> heapam_index_build_range_scan. I think this could just as well be
>> implemented by reading the indexInfo->ii_Concurrent field, as the
>> values should be equivalent, right?

> Not always; currently, it is set by ResetSnapshotsAllowed[5].
> We fall back to regular index build if there is a predicate or expression
in the index (which should be considered "safe" according to [6]).
> However, we may remove this check later.
> Additionally, there is no sense in resetting the snapshot if we already
have an xmin assigned to the backend for some reason.

I realized you were right. It's always possible to reset snapshots for
concurrent index building without any limitations related to predicates or
expressions.
Additionally, the PROC_IN_SAFE_IC flag is no longer necessary since
snapshots are rotating quickly, and it's possible to wait for them without
requiring any special exceptions for CREATE/REINDEX INDEX CONCURRENTLY.

Currently, it looks like this [1]. I've also attached a single large patch
just for the case.

I plan to restructure the patch into the following set:

* Introduce catalogXmin as a separate value to calculate the horizon for
the catalog.
* Add the STIR access method.
* Modify concurrent build/reindex to use an aux-index approach without
snapshot rotation.
* Add support for snapshot rotation for non-parallel and non-unique cases.
* Extend support for snapshot rotation in parallel index builds.
* Implement snapshot rotation support for unique indexes.

Best regards,
Mikhail

[1]:
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrent...

>


Attachments:

  [text/x-patch] create_index_concurrently_with_aux_index_or_rotated_snapshots.patch (207.6K, 3-create_index_concurrently_with_aux_index_or_rotated_snapshots.patch)
  download | inline diff:
Subject: [PATCH] a lot of refactoring
Ensure the correct determination of index safety to be used with set_indexsafe_procflags during REINDEX CONCURRENTLY
Revert "Revert "backend_catalog_xmin in pg_stat_activity""
revert the revert of catalogXmin
fix resetting snapshot during heapam_index_build_range_scan (snapshot is reset between pages)
apply v3-0002-Modify-the-infer_arbiter_indexes-function-to-cons.patch for test stability
fix unique check for building unique indexes
support for unique indexes
revert ThereAreNoPriorRegisteredSnapshots changes
revert ThereAreNoPriorRegisteredSnapshots changes
do not hold xmin while inserting to the index
rename jam to stir
delete ii_Auxiliary
Revert "introduce PROC->catalogXmin"
Revert "backend_catalog_xmin in pg_stat_activity"
some fixes for jam
few tunes
backend_catalog_xmin in pg_stat_activity
disable snapshot reset for unique indexes
just access method to use as index for validation
support for parallel building with snapshot reset
resetting snapshot during heap scan in the case of serial index build
resetting snapshot during validate_index
introduce PROC->catalogXmin
create index concurrently using auxiliary index
---
Index: src/backend/access/heap/heapam_handler.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
--- a/src/backend/access/heap/heapam_handler.c	(revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/backend/access/heap/heapam_handler.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -41,10 +41,12 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/injection_point.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -1191,11 +1193,11 @@
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		pop_active_snapshot = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
 	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-
 	/*
 	 * sanity checks
 	 */
@@ -1213,6 +1215,8 @@
 	 * only one of those is requested.
 	 */
 	Assert(!(anyvisible && checking_uniqueness));
+	Assert(!(anyvisible && indexInfo->ii_Concurrent));
+	Assert(!indexInfo->ii_Concurrent || !HaveRegisteredOrActiveSnapshot() || scan);
 
 	/*
 	 * Need an EState for evaluation of index expressions and partial-index
@@ -1252,17 +1256,22 @@
 		if (!TransactionIdIsValid(OldestXmin))
 		{
 			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			PushActiveSnapshot(snapshot);
+			need_unregister_snapshot = pop_active_snapshot = !indexInfo->ii_Concurrent;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 indexInfo->ii_Concurrent);
 	}
 	else
 	{
@@ -1726,8 +1735,12 @@
 	table_endscan(scan);
 
 	/* we can now forget our snapshot, if set and registered by us */
+	if (pop_active_snapshot)
+		PopActiveSnapshot();
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
+	if (indexInfo->ii_Concurrent && !hscan)
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	ExecDropSingleTupleTableSlot(slot);
 
@@ -1740,245 +1753,206 @@
 	return reltuples;
 }
 
-static void
-heapam_index_validate_scan(Relation heapRelation,
-						   Relation indexRelation,
-						   IndexInfo *indexInfo,
+static TransactionId
+heapam_index_validate_scan(Relation table_rel,
+						   Relation index_rel,
+						   Relation  aux_index_rel,
+						   struct IndexInfo *index_info,
+						   struct IndexInfo *aux_index_info,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   struct ValidateIndexState *state,
+						   struct ValidateIndexState *aux_state)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	IndexFetchTableData *fetch;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
 
 	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL,
+					prev_indexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded,
+					prev_decoded,
+					fetched;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+	instr_time		snapshotTime,
+					currentTime,
+					elapsed;
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	INSTR_TIME_SET_CURRENT(snapshotTime);
+	limitXmin = snapshot->xmin;
 
 	/*
 	 * sanity checks
 	 */
-	Assert(OidIsValid(indexRelation->rd_rel->relam));
+	Assert(OidIsValid(index_rel->rd_rel->relam));
+	Assert(OidIsValid(aux_index_rel->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
-	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+
+	slot = MakeSingleTupleTableSlot(RelationGetDescr(table_rel),
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	fetch = heapam_index_fetch_begin(table_rel);
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&prev_decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+	ItemPointerSetInvalid(&fetched);
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	prev_indexcursor = &prev_decoded;
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while (!auxtuplesort_empty)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
-
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
-
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		INSTR_TIME_SET_CURRENT(currentTime);
+		elapsed = currentTime;
+		INSTR_TIME_SUBTRACT(elapsed, snapshotTime);
+		if (INSTR_TIME_GET_MILLISEC(elapsed) >= VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
-		}
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
-		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
 
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+			INSTR_TIME_SET_CURRENT(snapshotTime);
 		}
 
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			Datum ts_val;
+			bool ts_isnull;
+			auxtuplesort_empty = !tuplesort_getdatum(aux_state->tuplesort, true,
+													 false, &ts_val, &ts_isnull,
+													 NULL);
+			Assert(auxtuplesort_empty || !ts_isnull);
+			if (!auxtuplesort_empty)
+			{
+				itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+				auxindexcursor = &auxdecoded;
+			}
+			else
 			{
-				/*
-				 * Remember index items seen earlier on the current heap page
-				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				auxindexcursor = NULL;
 			}
+		}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
-		}
+		if (!auxtuplesort_empty)
+		{
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				Datum ts_val;
+				bool ts_isnull;
+				prev_decoded = decoded;
+				tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+
+					if (ItemPointerCompare(prev_indexcursor, indexcursor) == 0)
+					{
+						elog(DEBUG5, "skipping duplicate tid in target index snapshot: (%u,%u)",
+							 ItemPointerGetBlockNumber(indexcursor),
+							 ItemPointerGetOffsetNumber(indexcursor));
+					}
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				bool call_again = false;
+				bool all_dead = false;
+				ItemPointer tid;
+
+				fetched = *auxindexcursor;
+				tid = &fetched;
+
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
 
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
+				if (heapam_index_fetch_tuple(fetch, tid, snapshot, slot, &call_again, &all_dead))
+				{
 
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
+					FormIndexDatum(index_info,
+								   slot,
+								   estate,
+								   values,
+								   isnull);
 
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
+					index_insert(index_rel,
+								 values,
+								 isnull,
+								 auxindexcursor, /* insert root tuple */
+								 table_rel,
+								 index_info->ii_Unique ?
+								 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+								 false,
+								 index_info);
 
-			state->tups_inserted += 1;
+					state->tups_inserted += 1;
+
+					elog(DEBUG5, "inserted tid: (%u,%u), root: (%u, %u)",
+						 					ItemPointerGetBlockNumber(auxindexcursor),
+											ItemPointerGetOffsetNumber(auxindexcursor),
+											ItemPointerGetBlockNumber(tid),
+											ItemPointerGetOffsetNumber(tid));
+				}
+				else
+				{
+					elog(DEBUG5, "skipping insert to target index because tid not visible: (%u,%u)",
+						 ItemPointerGetBlockNumber(auxindexcursor),
+						 ItemPointerGetOffsetNumber(auxindexcursor));
+				}
+			}
 		}
 	}
-
-	table_endscan(scan);
 
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
-	/* These may have been pointing to the now-gone estate */
-	indexInfo->ii_ExpressionsState = NIL;
-	indexInfo->ii_PredicateState = NULL;
+	heapam_index_fetch_end(fetch);
+
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
+
+	return limitXmin;
 }
 
 /*
Index: src/backend/catalog/index.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
--- a/src/backend/catalog/index.c	(revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/backend/catalog/index.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -67,6 +67,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -741,7 +742,8 @@
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -752,7 +754,6 @@
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
@@ -782,7 +783,6 @@
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -1459,13 +1459,151 @@
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
 	ReleaseSysCache(indexTuple);
 	ReleaseSysCache(classTuple);
 
+	return newIndexId;
+}
+
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID,
+							indexExprs,
+							indexPreds,
+							false, /* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false); /* aux are not summarizing */
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = RECORD_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
 	return newIndexId;
 }
 
@@ -1488,9 +1626,7 @@
 	int			save_nestlevel;
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
-
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Snapshot 	snapshot = InvalidSnapshot;
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1508,6 +1644,12 @@
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
+
+	/* BuildIndexInfo requires as snapshot for expressions and predicates */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
@@ -1518,11 +1660,17 @@
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
 
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	snapshot = InvalidSnapshot;
+
 	/* Now build the index */
-	index_build(heapRel, indexRelation, indexInfo, false, true);
+ 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
-	AtEOXact_GUC(false, save_nestlevel);
+ 	AtEOXact_GUC(false, save_nestlevel);
 
 	/* Restore userid and security context */
 	SetUserIdAndSecContext(save_userid, save_sec_context);
@@ -3177,7 +3325,8 @@
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true,  /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3288,34 +3437,59 @@
  * making the table append-only by setting use_fsm).  However that would
  * add yet more locking issues.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
-	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+			indexRelation,
+			auxIndexRelation;
+	IndexInfo  *indexInfo,
+				*auxIndexInfo;
+	Snapshot snapshot;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	int			main_work_mem_part = (maintenance_work_mem * 8) / 10;
 
 	{
 		const int	progress_index[] = {
-			PROGRESS_CREATEIDX_PHASE,
-			PROGRESS_CREATEIDX_TUPLES_DONE,
-			PROGRESS_CREATEIDX_TUPLES_TOTAL,
-			PROGRESS_SCAN_BLOCKS_DONE,
-			PROGRESS_SCAN_BLOCKS_TOTAL
+				PROGRESS_CREATEIDX_PHASE,
+				PROGRESS_CREATEIDX_TUPLES_DONE,
+				PROGRESS_CREATEIDX_TUPLES_TOTAL,
+				PROGRESS_SCAN_BLOCKS_DONE,
+				PROGRESS_SCAN_BLOCKS_TOTAL
 		};
 		const int64 progress_vals[] = {
-			PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN,
-			0, 0, 0, 0
+				PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN,
+				0, 0, 0, 0
 		};
 
 		pgstat_progress_update_multi_param(5, progress_index, progress_vals);
 	}
 
+	/*
+	 * Now take the "reference snapshot" that will be used by validate_index()
+	 * to filter candidate tuples.  Beware!  There might still be snapshots in
+	 * use that treat some transaction as in-progress that our reference
+	 * snapshot treats as committed.  If such a recently-committed transaction
+	 * deleted tuples in the table, we will not include them in the index; yet
+	 * those transactions which see the deleting one as still-in-progress will
+	 * expect such tuples to be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+
+	Assert(TransactionIdIsValid(MyProc->xmin));
+
 	/* Open and lock the parent heap relation */
 	heapRelation = table_open(heapId, ShareUpdateExclusiveLock);
 
@@ -3331,6 +3505,7 @@
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3338,9 +3513,11 @@
 	 * been built in a previous transaction.)
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	auxIndexInfo = BuildIndexInfo(auxIndexRelation);
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
+	auxIndexInfo->ii_Concurrent = true;
 
 	/*
 	 * Scan the index and gather up all the TIDs into a tuplesort object.
@@ -3353,6 +3530,10 @@
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
+
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
@@ -3360,9 +3541,27 @@
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+											   InvalidOid, false,
+											   maintenance_work_mem - main_work_mem_part,
+											   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, (void *) &auxState);
+
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3370,38 +3569,63 @@
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, (void *) &state);
 
+
+
 	/* Execute the sort */
 	{
 		const int	progress_index[] = {
-			PROGRESS_CREATEIDX_PHASE,
-			PROGRESS_SCAN_BLOCKS_DONE,
-			PROGRESS_SCAN_BLOCKS_TOTAL
+				PROGRESS_CREATEIDX_PHASE,
+				PROGRESS_SCAN_BLOCKS_DONE,
+				PROGRESS_SCAN_BLOCKS_TOTAL
 		};
 		const int64 progress_vals[] = {
-			PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT,
-			0, 0
+				PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT,
+				0, 0
 		};
 
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	/*
+	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.  But first, save the snapshot's xmin to use as
+	 * limitXmin for GetCurrentVirtualXIDs().
+ 	*/
+	limitXmin = snapshot->xmin;
+
+
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	snapshot = InvalidSnapshot;
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 
 	/*
 	 * Now scan the heap and "merge" it with the index
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
+	limitXmin = TransactionIdNewer(limitXmin, table_index_validate_scan(heapRelation,
 							  indexRelation,
+							  auxIndexRelation,
 							  indexInfo,
-							  snapshot,
-							  &state);
+							  auxIndexInfo,
+							  snapshot, /* may be invalid */
+							  &state,
+							  &auxState));
 
 	/* Done with tuplesort object */
 	tuplesort_end(state.tuplesort);
+	tuplesort_end(auxState.tuplesort);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
+	index_insert_cleanup(auxIndexRelation, auxIndexInfo);
 
 	elog(DEBUG2,
 		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
@@ -3414,8 +3638,13 @@
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3466,6 +3695,12 @@
 			Assert(!indexForm->indisready);
 			Assert(!indexForm->indisvalid);
 			indexForm->indisready = true;
+			break;
+		case INDEX_DROP_CLEAR_READY:
+			Assert(indexForm->indislive);
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
 			break;
 		case INDEX_CREATE_SET_VALID:
 			/* Set indisvalid during a CREATE INDEX CONCURRENTLY sequence */
Index: src/backend/catalog/toasting.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
--- a/src/backend/catalog/toasting.c	(revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/backend/catalog/toasting.c	(revision 6973360aaf4eb9012a60a5f2d5d46f022ac2d38c)
@@ -324,7 +324,8 @@
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
Index: src/backend/commands/indexcmds.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
--- a/src/backend/commands/indexcmds.c	(revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/backend/commands/indexcmds.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -69,6 +69,7 @@
 #include "utils/regproc.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /* non-export function prototypes */
@@ -112,7 +113,6 @@
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -428,8 +428,7 @@
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -449,8 +448,7 @@
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -542,7 +540,9 @@
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName;
 	char	   *accessMethodName;
+	Oid			auxIndexRelationId;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -561,7 +561,6 @@
 	bool		amissummarizing;
 	amoptions_function amoptions;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -571,10 +570,10 @@
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -808,6 +807,7 @@
 	 * Select name for index if caller didn't specify
 	 */
 	indexRelationName = stmt->idxname;
+	auxIndexRelationName = NULL;
 	if (indexRelationName == NULL)
 		indexRelationName = ChooseIndexName(RelationGetRelationName(rel),
 											namespaceId,
@@ -815,6 +815,12 @@
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -1116,10 +1122,6 @@
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1199,7 +1201,8 @@
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1595,6 +1598,28 @@
 
 		return address;
 	}
+	else
+	{
+		Oid			save_userid;
+		int			save_sec_context;
+		int			save_nestlevel;
+
+		GetUserIdAndSecContext(&save_userid, &save_sec_context);
+		SetUserIdAndSecContext(rel->rd_rel->relowner,
+							   save_sec_context | SECURITY_RESTRICTED_OPERATION);
+		save_nestlevel = NewGUCNestLevel();
+		RestrictSearchPath();
+
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+													tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+	}
 
 	/* save lockrelid and locktag for below, then close rel */
 	heaprelid = rel->rd_lockInfo.lockRelId;
@@ -1626,11 +1651,18 @@
 
 	PopActiveSnapshot();
 	CommitTransactionCommand();
-	StartTransactionCommand();
+
+	{
+		StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+		WaitForLockers(heaplocktag, ShareLock, true);
+		index_concurrently_build(tableId, auxIndexRelationId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
 
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
@@ -1685,25 +1717,15 @@
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1713,41 +1735,17 @@
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	WaitForLockers(heaplocktag, ShareLock, true);
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	CommitTransactionCommand();
 
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
+	StartTransactionCommand();
 
 	/*
 	 * Scan the index and the heap, insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1758,14 +1756,32 @@
 	 * transaction, and do our wait before any snapshot has been taken in it.
 	 */
 	CommitTransactionCommand();
+
+	{
+		StartTransactionCommand();
+		index_concurrently_set_dead(tableId, auxIndexRelationId);
+		CommitTransactionCommand();
+	}
+
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+
+	{
+		StartTransactionCommand();
+
+		/*
+		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+		 * right lock level.
+		 */
+		performDeletion(&auxAddress, DROP_RESTRICT,
+								 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
+		CommitTransactionCommand();
+	}
+
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/* We should now definitely not be advertising any xmin. */
-	Assert(MyProc->xmin == InvalidTransactionId);
+	Assert(MyProc->xmin == InvalidTransactionId && MyProc->catalogXmin == InvalidTransactionId);
 
 	/*
 	 * The index is now valid in the sense that it contains all currently
@@ -3431,9 +3447,9 @@
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3558,6 +3574,7 @@
 						oldcontext = MemoryContextSwitchTo(private_context);
 
 						idx = palloc_object(ReindexIndexInfo);
+						idx->auxIndexId = InvalidOid;
 						idx->indexId = cellOid;
 						/* other fields set later */
 
@@ -3608,6 +3625,7 @@
 							oldcontext = MemoryContextSwitchTo(private_context);
 
 							idx = palloc_object(ReindexIndexInfo);
+							idx->auxIndexId = InvalidOid;
 							idx->indexId = cellOid;
 							indexIds = lappend(indexIds, idx);
 							/* other fields set later */
@@ -3689,6 +3707,7 @@
 				 * that invalid indexes are allowed here.
 				 */
 				idx = palloc_object(ReindexIndexInfo);
+				idx->auxIndexId = InvalidOid;
 				idx->indexId = relationOid;
 				indexIds = lappend(indexIds, idx);
 				/* other fields set later */
@@ -3754,15 +3773,18 @@
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3781,9 +3803,6 @@
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (indexRel->rd_indexprs == NIL &&
-					 indexRel->rd_indpred == NIL);
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -3805,6 +3824,11 @@
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3819,11 +3843,17 @@
 													tablespaceid,
 													concurrentName);
 
+		auxIndexId = index_concurrently_create_aux(heapRel,
+													idx->indexId,
+													tablespaceid,
+													auxConcurrentName);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3831,8 +3861,8 @@
 		oldcontext = MemoryContextSwitchTo(private_context);
 
 		newidx = palloc_object(ReindexIndexInfo);
+		newidx->auxIndexId = auxIndexId;
 		newidx->indexId = newIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -3850,10 +3880,14 @@
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -3919,6 +3953,27 @@
 
 	PopActiveSnapshot();
 	CommitTransactionCommand();
+
+	{
+		StartTransactionCommand();
+		WaitForLockersMultiple(lockTags, ShareLock, true);
+		CommitTransactionCommand();
+	}
+
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
 	StartTransactionCommand();
 
 	/*
@@ -3955,13 +4010,6 @@
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -3976,7 +4024,6 @@
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
@@ -3999,12 +4046,21 @@
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
+
+	StartTransactionCommand();
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		CHECK_FOR_INTERRUPTS();
+
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+	}
+	CommitTransactionCommand();
 
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4015,17 +4071,6 @@
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4037,16 +4082,9 @@
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
 
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4085,13 +4123,6 @@
 
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4171,6 +4202,16 @@
 		index_concurrently_set_dead(oldidx->tableId, oldidx->indexId);
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		CHECK_FOR_INTERRUPTS();
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+	}
+
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4204,6 +4245,18 @@
 			object.classId = RelationRelationId;
 			object.objectId = idx->indexId;
 			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
 		}
@@ -4424,37 +4477,3 @@
 	heap_freetuple(tup);
 	table_close(classRel, RowExclusiveLock);
 }
-
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
Index: src/bin/pg_amcheck/t/006_concurrently.pl
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
--- /dev/null	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -0,0 +1,307 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	$node->pgbench(
+		'--no-vacuum --client=10 --transactions=10000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			'001_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			'002_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			# Ensure some HOT updates happen
+			'003_pgbench_concurrent_transaction_updates' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  )
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	my $pg_bench_fork_flag;
+	while (1) {
+		shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+		sleep(0.1);
+		last if $pg_bench_fork_flag eq "stop";
+	}
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr, $n, $stderr_saved);
+		$n = 0;
+
+		$node->psql('postgres', q(CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+                                  LANGUAGE plpgsql AS $$
+                                  BEGIN
+                                    EXECUTE 'SELECT txid_current()';
+                                    RETURN true;
+                                  END; $$;));
+
+		$node->psql('postgres', q(CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+                                  LANGUAGE plpgsql AS $$
+                                  BEGIN
+                                    RETURN MOD($1, 2) = 0;
+                                  END; $$;));
+		while (1)
+		{
+
+			if (int(rand(2)) == 0) {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=1);));
+			} else {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+			}
+			is($result, '0', 'ALTER TABLE is correct');
+
+			if (1)
+			{
+				($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+				is($result, '0', 'REINDEX is correct');
+
+				if ($result) {
+					diag($stderr);
+					BAIL_OUT($stderr);
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true, rootdescend => true, checkunique => true);));
+				is($result, '0', 'bt_index_check is correct');
+				if ($result)
+				{
+					diag($stderr);
+					BAIL_OUT($stderr);
+				} else {
+					diag('reindex:)' . $n++);
+				}
+			}
+
+			if (1)
+			{
+				my $variant = int(rand(7));
+				my $sql;
+				if ($variant == 0) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at););
+				} elsif ($variant == 1) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable(););
+				} elsif ($variant == 2) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;);
+				} elsif ($variant == 3) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i););
+				} elsif ($variant == 4) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i)););
+				} elsif ($variant == 5) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i););
+				} elsif ($variant == 6) {
+					$sql = q(CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+				} else { diag("wrong variant"); }
+
+				diag($sql);
+				($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+				is($result, '0', 'CREATE INDEX is correct');
+				$stderr_saved = $stderr;
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+				is($result, '0', 'bt_index_check for new index is correct');
+				if ($result)
+				{
+					diag($stderr);
+					diag($stderr_saved);
+					BAIL_OUT($stderr);
+				} else {
+					diag('create:)' . $n++);
+				}
+
+				if (1)
+				{
+					($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+					is($result, '0', 'REINDEX 2 is correct');
+					if ($result) {
+						diag($stderr);
+						BAIL_OUT($stderr);
+					}
+
+					($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+					is($result, '0', 'bt_index_check 2 is correct');
+					if ($result)
+					{
+						diag($stderr);
+						BAIL_OUT($stderr);
+					} else {
+						diag('reindex2:)' . $n++);
+					}
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+				is($result, '0', 'DROP INDEX is correct');
+			}
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+	$child->finalize();
+	$child->summary();
+	$node->stop;
+	done_testing();
+
+	shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+	my ($psql, $query, $untl) = @_;
+	my $ret;
+
+	# For each query we run, we'll restart the timeout.  Otherwise the timeout
+	# would apply to the whole test script, and would need to be set very high
+	# to survive when running under Valgrind.
+	$psql_timeout->reset();
+	$psql_timeout->start();
+
+	# send query
+	$$psql{stdin} .= $query;
+	$$psql{stdin} .= "\n";
+
+	# wait for query results
+	$$psql{run}->pump_nb();
+	while (1)
+	{
+		last if $$psql{stdout} =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		if (not $$psql{run}->pumpable())
+		{
+			diag("aborting wait: program died\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		$$psql{run}->pump();
+	}
+
+	$$psql{stdout} = '';
+
+	return 1;
+}
Index: src/include/access/tableam.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
--- a/src/include/access/tableam.h	(revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/include/access/tableam.h	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -70,6 +71,7 @@
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -703,11 +705,14 @@
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
 										Relation index_rel,
+										Relation aux_index_rel,
 										struct IndexInfo *index_info,
+										struct IndexInfo *aux_index_info,
 										Snapshot snapshot,
-										struct ValidateIndexState *state);
+										struct ValidateIndexState *state,
+										struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -931,7 +936,8 @@
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -939,6 +945,11 @@
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		flags |= (SO_RESET_SNAPSHOT | SO_TEMP_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1835,19 +1846,26 @@
  *
  * See validate_index() for an explanation.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
-						  Relation index_rel,
-						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+								   Relation index_rel,
+								   Relation aux_index_rel,
+								   struct IndexInfo *index_info,
+								   struct IndexInfo *aux_index_info,
+								   Snapshot snapshot,
+								   struct ValidateIndexState *state,
+								   struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+														index_rel,
+														aux_index_rel,
+														index_info,
+														aux_index_info,
+														snapshot,
+														state,
+														auxstate);
 }
+
 
 
 /* ----------------------------------------------------------------------------
Index: src/include/catalog/index.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
--- a/src/include/catalog/index.h	(revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/include/catalog/index.h	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -26,6 +26,7 @@
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
 	INDEX_DROP_CLEAR_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
 
@@ -43,6 +44,8 @@
 #define REINDEXOPT_MISSING_OK 	0x04	/* skip missing relations */
 #define REINDEXOPT_CONCURRENTLY	0x08	/* concurrent mode */
 
+#define VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL	50	/* 50 ms */
+
 /* state info for validate_index bulkdelete callback */
 typedef struct ValidateIndexState
 {
@@ -86,7 +89,8 @@
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -98,6 +102,11 @@
 										   Oid oldIndexId,
 										   Oid tablespaceOid,
 										   const char *newName);
+
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+											 Oid mainIndexId,
+											 Oid tablespaceOid,
+											 const char *newName);
 
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
@@ -144,7 +153,7 @@
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
Index: src/include/commands/progress.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
--- a/src/include/commands/progress.h	(revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/include/commands/progress.h	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
@@ -79,6 +79,7 @@
 
 /* Progress parameters for CREATE INDEX */
 /* 3, 4 and 5 reserved for "waitfor" metrics */
+// TODO: new phase names
 #define PROGRESS_CREATEIDX_COMMAND				0
 #define PROGRESS_CREATEIDX_INDEX_OID			6
 #define PROGRESS_CREATEIDX_ACCESS_METHOD_OID	8
@@ -91,6 +92,7 @@
 /* 15 and 16 reserved for "block number" metrics */
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
+// TODO: new phase names
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
 #define PROGRESS_CREATEIDX_PHASE_BUILD			2
 #define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
Index: src/test/regress/expected/create_index.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
--- a/src/test/regress/expected/create_index.out	(revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/test/regress/expected/create_index.out	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -1405,6 +1405,7 @@
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -2705,6 +2706,7 @@
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -2717,8 +2719,10 @@
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1 record_ops) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
Index: src/test/regress/expected/indexing.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
--- a/src/test/regress/expected/indexing.out	(revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/test/regress/expected/indexing.out	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
@@ -1571,10 +1571,11 @@
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
Index: src/test/regress/sql/create_index.sql
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
--- a/src/test/regress/sql/create_index.sql	(revision 2b5f57977f6d16796121d796835c48e4241b4da1)
+++ b/src/test/regress/sql/create_index.sql	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
@@ -493,6 +493,7 @@
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1147,10 +1148,12 @@
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
Index: src/backend/access/transam/twophase.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
--- a/src/backend/access/transam/twophase.c	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/access/transam/twophase.c	(revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -459,7 +459,7 @@
 		proc->vxid.procNumber = INVALID_PROC_NUMBER;
 	}
 	proc->xid = xid;
-	Assert(proc->xmin == InvalidTransactionId);
+	Assert(proc->xmin == InvalidTransactionId && proc->catalogXmin == InvalidTransactionId);
 	proc->delayChkptFlags = 0;
 	proc->statusFlags = 0;
 	proc->pid = 0;
Index: src/backend/replication/logical/reorderbuffer.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
--- a/src/backend/replication/logical/reorderbuffer.c	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/replication/logical/reorderbuffer.c	(revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -1844,6 +1844,7 @@
 	snap->active_count = 1;		/* mark as active so nobody frees it */
 	snap->regd_count = 0;
 	snap->xip = (TransactionId *) (snap + 1);
+	snap->catalog = orig_snap->catalog;
 
 	memcpy(snap->xip, orig_snap->xip, sizeof(TransactionId) * snap->xcnt);
 
Index: src/backend/replication/logical/snapbuild.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
--- a/src/backend/replication/logical/snapbuild.c	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/replication/logical/snapbuild.c	(revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -564,6 +564,7 @@
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
 	snapshot->snapXactCompletionCount = 0;
+	snapshot->catalog = false; // TODO: or true?
 
 	return snapshot;
 }
@@ -600,8 +601,8 @@
 		elog(ERROR, "cannot build an initial slot snapshot, not all transactions are monitored anymore");
 
 	/* so we don't overwrite the existing value */
-	if (TransactionIdIsValid(MyProc->xmin))
-		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
+	if (TransactionIdIsValid(MyProc->xmin) || TransactionIdIsValid(MyProc->catalogXmin))
+		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin or MyProc->catalogXmin already is valid");
 
 	snap = SnapBuildBuildSnapshot(builder);
 
@@ -622,7 +623,7 @@
 		elog(ERROR, "cannot build an initial slot snapshot as oldest safe xid %u follows snapshot's xmin %u",
 			 safeXid, snap->xmin);
 
-	MyProc->xmin = snap->xmin;
+	MyProc->xmin = MyProc->catalogXmin = snap->xmin;
 
 	/* allocate in transaction context */
 	newxip = (TransactionId *)
Index: src/backend/replication/walsender.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
--- a/src/backend/replication/walsender.c	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/replication/walsender.c	(revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -305,7 +305,7 @@
 	 */
 	if (MyDatabaseId == InvalidOid)
 	{
-		Assert(MyProc->xmin == InvalidTransactionId);
+		Assert(MyProc->xmin == InvalidTransactionId && MyProc->catalogXmin == InvalidTransactionId);
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		MyProc->statusFlags |= PROC_AFFECTS_ALL_HORIZONS;
 		ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
@@ -2498,7 +2498,7 @@
 	ReplicationSlot *slot = MyReplicationSlot;
 
 	SpinLockAcquire(&slot->mutex);
-	MyProc->xmin = InvalidTransactionId;
+	MyProc->xmin = MyProc->catalogXmin = InvalidTransactionId;
 
 	/*
 	 * For physical replication we don't need the interlock provided by xmin
@@ -2627,7 +2627,7 @@
 	if (!TransactionIdIsNormal(feedbackXmin)
 		&& !TransactionIdIsNormal(feedbackCatalogXmin))
 	{
-		MyProc->xmin = InvalidTransactionId;
+		MyProc->xmin = MyProc->catalogXmin = InvalidTransactionId;
 		if (MyReplicationSlot != NULL)
 			PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 		return;
@@ -2680,11 +2680,8 @@
 		PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
 	else
 	{
-		if (TransactionIdIsNormal(feedbackCatalogXmin)
-			&& TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
-			MyProc->xmin = feedbackCatalogXmin;
-		else
-			MyProc->xmin = feedbackXmin;
+		MyProc->catalogXmin = feedbackCatalogXmin;
+		MyProc->xmin = feedbackXmin;
 	}
 }
 
Index: src/backend/storage/ipc/procarray.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
--- a/src/backend/storage/ipc/procarray.c	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/storage/ipc/procarray.c	(revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -701,7 +701,7 @@
 		Assert(!proc->subxidStatus.overflowed);
 
 		proc->vxid.lxid = InvalidLocalTransactionId;
-		proc->xmin = InvalidTransactionId;
+		proc->xmin = proc->catalogXmin = InvalidTransactionId;
 
 		/* be sure this is cleared in abort */
 		proc->delayChkptFlags = 0;
@@ -743,7 +743,7 @@
 	ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
 	proc->xid = InvalidTransactionId;
 	proc->vxid.lxid = InvalidLocalTransactionId;
-	proc->xmin = InvalidTransactionId;
+	proc->xmin = proc->catalogXmin = InvalidTransactionId;
 
 	/* be sure this is cleared in abort */
 	proc->delayChkptFlags = 0;
@@ -930,7 +930,7 @@
 	proc->xid = InvalidTransactionId;
 
 	proc->vxid.lxid = InvalidLocalTransactionId;
-	proc->xmin = InvalidTransactionId;
+	proc->xmin = proc->catalogXmin = InvalidTransactionId;
 	proc->recoveryConflictPending = false;
 
 	Assert(!(proc->statusFlags & PROC_VACUUM_STATE_MASK));
@@ -1739,8 +1739,6 @@
 	bool		in_recovery = RecoveryInProgress();
 	TransactionId *other_xids = ProcGlobal->xids;
 
-	/* inferred after ProcArrayLock is released */
-	h->catalog_oldest_nonremovable = InvalidTransactionId;
 
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
@@ -1761,6 +1759,7 @@
 
 		h->oldest_considered_running = initial;
 		h->shared_oldest_nonremovable = initial;
+		h->catalog_oldest_nonremovable = initial;
 		h->data_oldest_nonremovable = initial;
 
 		/*
@@ -1796,10 +1795,13 @@
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
+		TransactionId catalogXmin;
+		TransactionId olderXmin;
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = UINT32_ACCESS_ONCE(other_xids[index]);
 		xmin = UINT32_ACCESS_ONCE(proc->xmin);
+		catalogXmin = UINT32_ACCESS_ONCE(proc->catalogXmin);
 
 		/*
 		 * Consider both the transaction's Xmin, and its Xid.
@@ -1809,11 +1811,14 @@
 		 * some not-yet-set Xmin.
 		 */
 		xmin = TransactionIdOlder(xmin, xid);
+		catalogXmin = TransactionIdOlder(catalogXmin, xid);
 
 		/* if neither is set, this proc doesn't influence the horizon */
-		if (!TransactionIdIsValid(xmin))
+		if (!TransactionIdIsValid(xmin) && !TransactionIdIsValid(catalogXmin))
 			continue;
 
+		olderXmin = TransactionIdOlder(xmin, catalogXmin);
+
 		/*
 		 * Don't ignore any procs when determining which transactions might be
 		 * considered running.  While slots should ensure logical decoding
@@ -1821,7 +1826,7 @@
 		 * include them here as well..
 		 */
 		h->oldest_considered_running =
-			TransactionIdOlder(h->oldest_considered_running, xmin);
+			TransactionIdOlder(h->oldest_considered_running, olderXmin);
 
 		/*
 		 * Skip over backends either vacuuming (which is ok with rows being
@@ -1833,7 +1838,7 @@
 
 		/* shared tables need to take backends in all databases into account */
 		h->shared_oldest_nonremovable =
-			TransactionIdOlder(h->shared_oldest_nonremovable, xmin);
+			TransactionIdOlder(h->shared_oldest_nonremovable, olderXmin);
 
 		/*
 		 * Normally sessions in other databases are ignored for anything but
@@ -1859,8 +1864,12 @@
 			(statusFlags & PROC_AFFECTS_ALL_HORIZONS) ||
 			in_recovery)
 		{
-			h->data_oldest_nonremovable =
-				TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+			if (TransactionIdIsValid(xmin))
+				h->data_oldest_nonremovable =
+					TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+			if (TransactionIdIsValid(olderXmin))
+				h->catalog_oldest_nonremovable =
+						TransactionIdOlder(h->catalog_oldest_nonremovable, olderXmin);
 		}
 	}
 
@@ -1885,6 +1894,8 @@
 			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
 		h->data_oldest_nonremovable =
 			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+		h->catalog_oldest_nonremovable =
+			TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
 		/* temp relations cannot be accessed in recovery */
 	}
 
@@ -1912,7 +1923,6 @@
 	h->shared_oldest_nonremovable =
 		TransactionIdOlder(h->shared_oldest_nonremovable,
 						   h->slot_catalog_xmin);
-	h->catalog_oldest_nonremovable = h->data_oldest_nonremovable;
 	h->catalog_oldest_nonremovable =
 		TransactionIdOlder(h->catalog_oldest_nonremovable,
 						   h->slot_catalog_xmin);
@@ -2092,7 +2102,7 @@
  * least in the case we already hold a snapshot), but that's for another day.
  */
 static bool
-GetSnapshotDataReuse(Snapshot snapshot)
+GetSnapshotDataReuse(Snapshot snapshot, bool catalog)
 {
 	uint64		curXactCompletionCount;
 
@@ -2101,6 +2111,9 @@
 	if (unlikely(snapshot->snapXactCompletionCount == 0))
 		return false;
 
+	if (unlikely(snapshot->catalog != catalog))
+		return false;
+
 	curXactCompletionCount = TransamVariables->xactCompletionCount;
 	if (curXactCompletionCount != snapshot->snapXactCompletionCount)
 		return false;
@@ -2125,8 +2138,19 @@
 	 * requirement that concurrent GetSnapshotData() calls yield the same
 	 * xmin.
 	 */
-	if (!TransactionIdIsValid(MyProc->xmin))
-		MyProc->xmin = TransactionXmin = snapshot->xmin;
+	if (!catalog)
+	{
+		if (!TransactionIdIsValid(MyProc->xmin))
+			MyProc->xmin = snapshot->xmin;
+	}
+	else
+	{
+		if (!TransactionIdIsValid(MyProc->catalogXmin))
+			MyProc->catalogXmin = snapshot->xmin;
+	}
+
+	if (!TransactionIdIsValid(TransactionXmin))
+		TransactionXmin = snapshot->xmin;
 
 	RecentXmin = snapshot->xmin;
 	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
@@ -2173,8 +2197,8 @@
  * Note: this function should probably not be called with an argument that's
  * not statically allocated (see xip allocation below).
  */
-Snapshot
-GetSnapshotData(Snapshot snapshot)
+static Snapshot
+GetSnapshotDataImpl(Snapshot snapshot, bool catalog)
 {
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId *other_xids = ProcGlobal->xids;
@@ -2232,7 +2256,7 @@
 	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
-	if (GetSnapshotDataReuse(snapshot))
+	if (GetSnapshotDataReuse(snapshot, catalog))
 	{
 		LWLockRelease(ProcArrayLock);
 		return snapshot;
@@ -2412,8 +2436,18 @@
 	replication_slot_xmin = procArray->replication_slot_xmin;
 	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
-	if (!TransactionIdIsValid(MyProc->xmin))
-		MyProc->xmin = TransactionXmin = xmin;
+	if (!catalog)
+	{
+		if (!TransactionIdIsValid(MyProc->xmin))
+			MyProc->xmin = xmin;
+	}
+	else
+	{
+		if (!TransactionIdIsValid(MyProc->catalogXmin))
+			MyProc->catalogXmin = xmin;
+	}
+	if (!TransactionIdIsValid(TransactionXmin))
+		TransactionXmin = xmin;
 
 	LWLockRelease(ProcArrayLock);
 
@@ -2506,6 +2540,7 @@
 	snapshot->subxcnt = subcount;
 	snapshot->suboverflowed = suboverflowed;
 	snapshot->snapXactCompletionCount = curXactCompletionCount;
+	snapshot->catalog = catalog;
 
 	snapshot->curcid = GetCurrentCommandId(false);
 
@@ -2522,6 +2557,19 @@
 	return snapshot;
 }
 
+Snapshot
+GetSnapshotData(Snapshot snapshot)
+{
+	return GetSnapshotDataImpl(snapshot, false);
+}
+
+
+Snapshot
+GetCatalogSnapshotData(Snapshot snapshot)
+{
+	return GetSnapshotDataImpl(snapshot, true);
+}
+
 /*
  * ProcArrayInstallImportedXmin -- install imported xmin into MyProc->xmin
  *
@@ -2592,7 +2640,7 @@
 		 * GetSnapshotData first, we'll be overwriting a valid xmin here, so
 		 * we don't check that.)
 		 */
-		MyProc->xmin = TransactionXmin = xmin;
+		MyProc->xmin = MyProc->catalogXmin = TransactionXmin = xmin;
 
 		result = true;
 		break;
@@ -2645,7 +2693,7 @@
 		 * Install xmin and propagate the statusFlags that affect how the
 		 * value is interpreted by vacuum.
 		 */
-		MyProc->xmin = TransactionXmin = xmin;
+		MyProc->xmin = MyProc->catalogXmin = TransactionXmin = xmin;
 		MyProc->statusFlags = (MyProc->statusFlags & ~PROC_XMIN_FLAGS) |
 			(proc->statusFlags & PROC_XMIN_FLAGS);
 		ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
@@ -3162,7 +3210,8 @@
  */
 void
 ProcNumberGetTransactionIds(ProcNumber procNumber, TransactionId *xid,
-							TransactionId *xmin, int *nsubxid, bool *overflowed)
+							TransactionId *xmin, TransactionId *catalogXmin,
+							int *nsubxid, bool *overflowed)
 {
 	PGPROC	   *proc;
 
@@ -3182,6 +3231,7 @@
 	{
 		*xid = proc->xid;
 		*xmin = proc->xmin;
+		*catalogXmin = proc->catalogXmin;
 		*nsubxid = proc->subxidStatus.count;
 		*overflowed = proc->subxidStatus.overflowed;
 	}
@@ -3356,8 +3406,10 @@
 		{
 			/* Fetch xmin just once - might change on us */
 			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
+			TransactionId pcatalogXmin = UINT32_ACCESS_ONCE(proc->catalogXmin);
+			TransactionId olderpXmin = TransactionIdOlder(pxmin, pcatalogXmin);
 
-			if (excludeXmin0 && !TransactionIdIsValid(pxmin))
+			if (excludeXmin0 && !TransactionIdIsValid(olderpXmin))
 				continue;
 
 			/*
@@ -3365,7 +3417,7 @@
 			 * hasn't set xmin yet will not be rejected by this test.
 			 */
 			if (!TransactionIdIsValid(limitXmin) ||
-				TransactionIdPrecedesOrEquals(pxmin, limitXmin))
+				TransactionIdPrecedesOrEquals(olderpXmin, limitXmin))
 			{
 				VirtualTransactionId vxid;
 
@@ -3456,6 +3508,8 @@
 		{
 			/* Fetch xmin just once - can't change on us, but good coding */
 			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
+			TransactionId catalogpXmin = UINT32_ACCESS_ONCE(proc->catalogXmin);
+			TransactionId oldestpXmin = TransactionIdOlder(pxmin, catalogpXmin);
 
 			/*
 			 * We ignore an invalid pxmin because this means that backend has
@@ -3466,7 +3520,7 @@
 			 * test here.
 			 */
 			if (!TransactionIdIsValid(limitXmin) ||
-				(TransactionIdIsValid(pxmin) && !TransactionIdFollows(pxmin, limitXmin)))
+				(TransactionIdIsValid(oldestpXmin) && !TransactionIdFollows(oldestpXmin, limitXmin)))
 			{
 				VirtualTransactionId vxid;
 
Index: src/backend/storage/lmgr/proc.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
--- a/src/backend/storage/lmgr/proc.c	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/storage/lmgr/proc.c	(revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -382,7 +382,7 @@
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyProc->xid = InvalidTransactionId;
-	MyProc->xmin = InvalidTransactionId;
+	MyProc->xmin = MyProc->catalogXmin = InvalidTransactionId;
 	MyProc->pid = MyProcPid;
 	MyProc->vxid.procNumber = MyProcNumber;
 	MyProc->vxid.lxid = InvalidLocalTransactionId;
@@ -580,7 +580,7 @@
 	MyProc->fpVXIDLock = false;
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyProc->xid = InvalidTransactionId;
-	MyProc->xmin = InvalidTransactionId;
+	MyProc->xmin = MyProc->catalogXmin = InvalidTransactionId;
 	MyProc->vxid.procNumber = INVALID_PROC_NUMBER;
 	MyProc->vxid.lxid = InvalidLocalTransactionId;
 	MyProc->databaseId = InvalidOid;
Index: src/backend/utils/time/snapmgr.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
--- a/src/backend/utils/time/snapmgr.c	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/backend/utils/time/snapmgr.c	(revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -290,14 +290,6 @@
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
@@ -332,6 +324,16 @@
 		RegisteredLSN = OldestRegisteredSnapshot->lsn;
 	}
 
+	if (CatalogSnapshot != NULL)
+	{
+		if (OldestRegisteredSnapshot == NULL ||
+					TransactionIdPrecedes(CatalogSnapshot->xmin, OldestRegisteredSnapshot->xmin))
+		{
+			OldestRegisteredSnapshot = CatalogSnapshot;
+			RegisteredLSN = CatalogSnapshot->lsn;
+		}
+	}
+
 	if (OldestActiveSnapshot != NULL)
 	{
 		XLogRecPtr	ActiveLSN = OldestActiveSnapshot->as_snap->lsn;
@@ -388,7 +390,7 @@
 	if (CatalogSnapshot == NULL)
 	{
 		/* Get new snapshot. */
-		CatalogSnapshot = GetSnapshotData(&CatalogSnapshotData);
+		CatalogSnapshot = GetCatalogSnapshotData(&CatalogSnapshotData);
 
 		/*
 		 * Make sure the catalog snapshot will be accounted for in decisions
@@ -402,7 +404,7 @@
 		 * NB: it had better be impossible for this to throw error, since the
 		 * CatalogSnapshot pointer is already valid.
 		 */
-		pairingheap_add(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
+		Assert(TransactionIdIsValid(MyProc->catalogXmin));
 	}
 
 	return CatalogSnapshot;
@@ -423,9 +425,8 @@
 {
 	if (CatalogSnapshot)
 	{
-		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
-		SnapshotResetXmin();
+		MyProc->catalogXmin = InvalidTransactionId;
 	}
 }
 
@@ -444,7 +445,7 @@
 {
 	if (CatalogSnapshot &&
 		ActiveSnapshot == NULL &&
-		pairingheap_is_singular(&RegisteredSnapshots))
+		pairingheap_is_empty(&RegisteredSnapshots))
 		InvalidateCatalogSnapshot();
 }
 
@@ -1081,7 +1082,7 @@
 	if (resetXmin)
 		SnapshotResetXmin();
 
-	Assert(resetXmin || MyProc->xmin == 0);
+	Assert(resetXmin || (MyProc->xmin == InvalidTransactionId && MyProc->catalogXmin == InvalidTransactionId));
 }
 
 
@@ -1626,19 +1627,15 @@
 	if (ActiveSnapshot != NULL)
 		return true;
 
-	/*
-	 * The catalog snapshot is in RegisteredSnapshots when valid, but can be
-	 * removed at any time due to invalidation processing. If explicitly
-	 * registered more than one snapshot has to be in RegisteredSnapshots.
-	 */
-	if (CatalogSnapshot != NULL &&
-		pairingheap_is_singular(&RegisteredSnapshots))
-		return false;
+	return HaveRegisteredSnapshot();
+}
 
+bool
+HaveRegisteredSnapshot(void)
+{
 	return !pairingheap_is_empty(&RegisteredSnapshots);
 }
 
-
 /*
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
@@ -1804,6 +1801,7 @@
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
 	snapshot->snapXactCompletionCount = 0;
+	snapshot->catalog = false;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
Index: src/include/storage/proc.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
--- a/src/include/storage/proc.h	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/include/storage/proc.h	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -56,10 +56,6 @@
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a small number of "weak" relation locks (AccessShareLock,
@@ -179,6 +175,7 @@
 								 * starting our xact, excluding LAZY VACUUM:
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
+	TransactionId catalogXmin;
 
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
Index: src/include/storage/procarray.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
--- a/src/include/storage/procarray.h	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/include/storage/procarray.h	(revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -45,6 +45,7 @@
 extern int	GetMaxSnapshotSubxidCount(void);
 
 extern Snapshot GetSnapshotData(Snapshot snapshot);
+extern Snapshot GetCatalogSnapshotData(Snapshot snapshot);
 
 extern bool ProcArrayInstallImportedXmin(TransactionId xmin,
 										 VirtualTransactionId *sourcevxid);
@@ -66,8 +67,8 @@
 
 extern PGPROC *ProcNumberGetProc(int procNumber);
 extern void ProcNumberGetTransactionIds(int procNumber, TransactionId *xid,
-										TransactionId *xmin, int *nsubxid,
-										bool *overflowed);
+										TransactionId *xmin, TransactionId *catalogXmin,
+										int *nsubxid, bool *overflowed);
 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
 extern int	BackendXidGetPid(TransactionId xid);
Index: src/include/utils/snapshot.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
--- a/src/include/utils/snapshot.h	(revision bbc09a323cc3d6c54f2d26c7c6342d36d7edeb31)
+++ b/src/include/utils/snapshot.h	(revision 03c4ff69cbbfa3182e697672d7ea704db293213f)
@@ -183,6 +183,7 @@
 
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
 	bool		copied;			/* false if it's a static snapshot */
+	bool		catalog;		/* snapshot used to access catalog */
 
 	CommandId	curcid;			/* in my xact, CID < curcid are visible */
 
Index: contrib/amcheck/verify_nbtree.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
--- a/contrib/amcheck/verify_nbtree.c	(revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/contrib/amcheck/verify_nbtree.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -691,7 +691,8 @@
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
Index: src/backend/access/brin/brin.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
--- a/src/backend/access/brin/brin.c	(revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/brin/brin.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -2369,16 +2369,7 @@
 	leaderparticipates = false;
 #endif
 
-	/*
-	 * Enter parallel mode, and create context for parallel build of brin
-	 * index
-	 */
-	EnterParallelMode();
-	Assert(request > 0);
-	pcxt = CreateParallelContext("postgres", "_brin_parallel_build_main",
-								 request);
-
-	scantuplesortstates = leaderparticipates ? request + 1 : request;
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
@@ -2390,7 +2381,21 @@
 	if (!isconcurrent)
 		snapshot = SnapshotAny;
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of brin
+	 * index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_brin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2429,6 +2434,8 @@
 
 	/* Everyone's had a chance to ask for space, so now create the DSM */
 	InitializeParallelDSM(pcxt);
+	if (IsMVCCSnapshot(snapshot))
+		PopActiveSnapshot();
 
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
@@ -2458,7 +2465,7 @@
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2504,7 +2511,7 @@
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
+	brinleader->snapshot = isconcurrent ? InvalidSnapshot : snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2518,6 +2525,12 @@
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	if (isconcurrent)
+	{
+		WaitForParallelWorkersToAttach(pcxt, true);
+		UnregisterSnapshot(snapshot);
+	}
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2526,7 +2539,8 @@
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 }
 
 /*
@@ -2536,6 +2550,7 @@
 _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 {
 	int			i;
+	Snapshot 	snapshot = brinleader->snapshot;
 
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(brinleader->pcxt);
@@ -2548,8 +2563,10 @@
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
+	Assert(!brinleader->brinshared->isconcurrent || snapshot == InvalidSnapshot);
+	Assert(brinleader->brinshared->isconcurrent || snapshot != InvalidSnapshot);
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
+		UnregisterSnapshot(snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2800,6 +2817,7 @@
 	TableScanDesc scan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
+	Snapshot	snapshot;
 
 	/* Initialize local tuplesort coordination state */
 	coordinate = palloc0(sizeof(SortCoordinateData));
@@ -2811,8 +2829,21 @@
 	state->bs_sortstate = tuplesort_begin_index_brin(sortmem, coordinate,
 													 TUPLESORT_NONE);
 
+	Assert(!brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xmin));
+	/* Join parallel scan */
+	if (brinshared->isconcurrent)
+	{
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		UnregisterSnapshot(snapshot);
+	}
+	Assert(!brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	indexInfo->ii_Concurrent = brinshared->isconcurrent;
 
 	scan = table_beginscan_parallel(heap,
@@ -2866,8 +2897,7 @@
 	 * The only possible status flag that can be set to the parallel worker is
 	 * PROC_IN_SAFE_IC.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
@@ -2913,8 +2943,12 @@
 	 */
 	sortmem = maintenance_work_mem / brinshared->scantuplesortstates;
 
+	if (brinshared->isconcurrent)
+		PopActiveSnapshot();
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+		PushActiveSnapshot(GetLatestSnapshot());
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
Index: src/backend/access/gin/gininsert.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
--- a/src/backend/access/gin/gininsert.c	(revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/gin/gininsert.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -17,6 +17,7 @@
 #include "access/gin_private.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
Index: src/backend/access/gist/gistbuild.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
--- a/src/backend/access/gist/gistbuild.c	(revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/gist/gistbuild.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -38,6 +38,7 @@
 #include "access/gist_private.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "optimizer/optimizer.h"
Index: src/backend/access/hash/hash.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
--- a/src/backend/access/hash/hash.c	(revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/hash/hash.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -23,6 +23,7 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
Index: src/backend/access/heap/heapam.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
--- a/src/backend/access/heap/heapam.c	(revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/heap/heapam.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -575,6 +575,24 @@
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	Assert(ActiveSnapshotSet());
+	PopActiveSnapshot();
+	UnregisterSnapshot(sscan->rs_snapshot);
+	sscan->rs_snapshot = InvalidSnapshot;
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	sscan->rs_snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(sscan->rs_snapshot);
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -593,6 +611,11 @@
 		scan->rs_cbuf = InvalidBuffer;
 	}
 
+	if (unlikely(scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) & likely(scan->rs_inited))
+	{
+		heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
+
 	/*
 	 * Be sure to check for interrupts at least once per page.  Checks at
 	 * higher code levels won't be able to stop a seqscan that encounters many
@@ -1242,6 +1265,13 @@
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT);
+		Assert(ActiveSnapshotSet());
+		PopActiveSnapshot();
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
Index: src/backend/access/nbtree/nbtsort.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
--- a/src/backend/access/nbtree/nbtsort.c	(revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/nbtree/nbtsort.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -84,6 +84,7 @@
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool 		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -377,6 +378,7 @@
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -425,8 +427,9 @@
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -435,7 +438,7 @@
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -443,7 +446,7 @@
 		/* Initialize secondary spool */
 		btspool2->heap = heap;
 		btspool2->index = index;
-		btspool2->isunique = false;
+		btspool2->isunique = btspool2->unique_dead_ignored = false;
 		/* Save as secondary spool */
 		buildstate->spool2 = btspool2;
 
@@ -466,7 +469,7 @@
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -1145,11 +1148,13 @@
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	fail_on_duplicate = (btspool->unique_dead_ignored && btspool->isunique && btspool2 == NULL);
 
 	if (merge)
 	{
@@ -1353,6 +1358,80 @@
 
 		pfree(dstate);
 	}
+	else if (fail_on_duplicate)
+	{
+		bool was_valid = false,
+		 	 prev_checked = false,
+			 was_null;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL &&
+					((wstate->inskey->allequalimage &&
+							_bt_keep_natts_fast_wasnull(wstate->index, prev, itup, &was_null) > keysz) ||
+						(_bt_keep_natts_wasnull(wstate->index, prev, itup,wstate->inskey, &was_null) > keysz)
+					) &&
+					(btspool->nulls_not_distinct && was_null))
+			{
+				bool call_again, ignored, now_valid;
+				ItemPointerData tid;
+				if (!prev_checked)
+				{
+					call_again = false;
+					tid = prev->t_tid;
+					was_valid = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+					prev_checked = true;
+				}
+
+				call_again = false;
+				tid = itup->t_tid;
+				now_valid = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+				if (was_valid && now_valid)
+				{
+					char	   *key_desc;
+					TupleDesc	tupDes = RelationGetDescr(wstate->index);
+					bool		isnull[INDEX_MAX_KEYS];
+					Datum		values[INDEX_MAX_KEYS];
+
+					index_deform_tuple(itup, tupDes, values, isnull);
+
+					key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+					ereport(ERROR,
+							(errcode(ERRCODE_UNIQUE_VIOLATION),
+									errmsg("could not create unique index \"%s\"",
+										   RelationGetRelationName(wstate->index)),
+									key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+									errdetail("Duplicate keys exist."),
+									errtableconstraint(wstate->heap,
+													   RelationGetRelationName(wstate->index))));
+				}
+				was_valid |= now_valid;
+			}
+			else
+			{
+				was_valid = false;
+				prev_checked = false;
+			}
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+	}
 	else
 	{
 		/* merging and deduplication are both unnecessary */
@@ -1414,17 +1493,7 @@
 	leaderparticipates = false;
 #endif
 
-	/*
-	 * Enter parallel mode, and create context for parallel build of btree
-	 * index
-	 */
-	EnterParallelMode();
-	Assert(request > 0);
-	pcxt = CreateParallelContext("postgres", "_bt_parallel_build_main",
-								 request);
-
-	scantuplesortstates = leaderparticipates ? request + 1 : request;
-
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1435,7 +1504,20 @@
 	if (!isconcurrent)
 		snapshot = SnapshotAny;
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
+	/*
+	 * Enter parallel mode, and create context for parallel build of btree
+	 * index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_bt_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1450,7 +1532,7 @@
 	 * Unique case requires a second spool, and so we may have to account for
 	 * another shared workspace for that -- PARALLEL_KEY_TUPLESORT_SPOOL2
 	 */
-	if (!btspool->isunique)
+	if (!btspool->isunique || isconcurrent)
 		shm_toc_estimate_keys(&pcxt->estimator, 2);
 	else
 	{
@@ -1485,6 +1567,8 @@
 
 	/* Everyone's had a chance to ask for space, so now create the DSM */
 	InitializeParallelDSM(pcxt);
+	if (IsMVCCSnapshot(snapshot))
+		PopActiveSnapshot();
 
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
@@ -1515,7 +1599,7 @@
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1529,7 +1613,7 @@
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
 
 	/* Unique case requires a second spool, and associated shared state */
-	if (!btspool->isunique)
+	if (!btspool->isunique || isconcurrent)
 		sharedsort2 = NULL;
 	else
 	{
@@ -1575,7 +1659,7 @@
 	btleader->btshared = btshared;
 	btleader->sharedsort = sharedsort;
 	btleader->sharedsort2 = sharedsort2;
-	btleader->snapshot = snapshot;
+	btleader->snapshot = isconcurrent ? InvalidSnapshot : snapshot;
 	btleader->walusage = walusage;
 	btleader->bufferusage = bufferusage;
 
@@ -1589,15 +1673,25 @@
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	if (isconcurrent)
+	{
+		WaitForParallelWorkersToAttach(pcxt, true);
+		UnregisterSnapshot(snapshot);
+	}
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
+	{
+		INJECTION_POINT("_bt_leader_participate_as_worker");
 		_bt_leader_participate_as_worker(buildstate);
+	}
 
 	/*
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 }
 
 /*
@@ -1607,6 +1701,7 @@
 _bt_end_parallel(BTLeader *btleader)
 {
 	int			i;
+	Snapshot snapshot = btleader->snapshot;
 
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(btleader->pcxt);
@@ -1619,8 +1714,10 @@
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
-		UnregisterSnapshot(btleader->snapshot);
+	Assert(!btleader->btshared->isconcurrent || snapshot == InvalidSnapshot);
+	Assert(btleader->btshared->isconcurrent || snapshot != InvalidSnapshot);
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
+		UnregisterSnapshot(snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
 }
@@ -1697,9 +1794,10 @@
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = btleader->btshared->isconcurrent;
 
 	/* Initialize second spool, if required */
-	if (!btleader->btshared->isunique)
+	if (!btleader->btshared->isunique || btleader->btshared->isconcurrent)
 		leaderworker2 = NULL;
 	else
 	{
@@ -1709,7 +1807,7 @@
 		/* Initialize worker's own secondary spool */
 		leaderworker2->heap = leaderworker->heap;
 		leaderworker2->index = leaderworker->index;
-		leaderworker2->isunique = false;
+		leaderworker2->isunique = leaderworker2->unique_dead_ignored = false;
 	}
 
 	/*
@@ -1758,12 +1856,7 @@
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
-	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
@@ -1796,12 +1889,13 @@
 	btspool->heap = heapRel;
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
+	btspool->unique_dead_ignored = btshared->isconcurrent;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1814,7 +1908,7 @@
 		/* Initialize worker's own secondary spool */
 		btspool2->heap = btspool->heap;
 		btspool2->index = btspool->index;
-		btspool2->isunique = false;
+		btspool2->isunique = btspool2->unique_dead_ignored = false;
 		/* Look up shared state private to tuplesort.c */
 		sharedsort2 = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT_SPOOL2, false);
 		tuplesort_attach_shared(sharedsort2, seg);
@@ -1825,8 +1919,12 @@
 
 	/* Perform sorting of spool, and possibly a spool2 */
 	sortmem = maintenance_work_mem / btshared->scantuplesortstates;
+	if (btshared->isconcurrent)
+		PopActiveSnapshot();
 	_bt_parallel_scan_and_sort(btspool, btspool2, btshared, sharedsort,
 							   sharedsort2, sortmem, false);
+	if (btshared->isconcurrent)
+		PushActiveSnapshot(GetLatestSnapshot());
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
@@ -1868,6 +1966,7 @@
 	TableScanDesc scan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
+	Snapshot snapshot;
 
 	/* Initialize local tuplesort coordination state */
 	coordinate = palloc0(sizeof(SortCoordinateData));
@@ -1880,6 +1979,7 @@
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1902,7 +2002,8 @@
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index,
+										false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
@@ -1917,13 +2018,27 @@
 	buildstate.indtuples = 0;
 	buildstate.btleader = NULL;
 
+	Assert(!btshared->isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	/* Join parallel scan */
+	if (btshared->isconcurrent)
+	{
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 	indexInfo = BuildIndexInfo(btspool->index);
+	if (btshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		UnregisterSnapshot(snapshot);
+	}
+	Assert(!btshared->isconcurrent || !TransactionIdIsValid(MyProc->xmin));
+
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
 	scan = table_beginscan_parallel(btspool->heap,
 									ParallelTableScanFromBTShared(btshared));
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
-									   true, progress, _bt_build_callback,
+									   true, progress,
+									   _bt_build_callback,
 									   (void *) &buildstate, scan);
 
 	/* Execute this worker's part of the sort */
Index: src/backend/access/spgist/spginsert.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
--- a/src/backend/access/spgist/spginsert.c	(revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/access/spgist/spginsert.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -20,6 +20,7 @@
 #include "access/spgist_private.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
Index: src/backend/optimizer/plan/planner.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
--- a/src/backend/optimizer/plan/planner.c	(revision 91dd70fc5ddc60cbad5b17c95f17c6a517f36770)
+++ b/src/backend/optimizer/plan/planner.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6791,6 +6792,7 @@
 	BlockNumber heap_blocks;
 	double		reltuples;
 	double		allvisfrac;
+	Snapshot	snapshot = InvalidSnapshot;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -6842,6 +6844,10 @@
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	if (!ActiveSnapshotSet()) {
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6899,6 +6905,12 @@
 		parallel_workers--;
 
 done:
+	if (snapshot != InvalidSnapshot)
+	{
+		PopActiveSnapshot();
+		UnregisterSnapshot(snapshot);
+	}
+
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
Index: src/backend/access/table/tableam.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
--- a/src/backend/access/table/tableam.c	(revision 103bbb703f974c65be6e238ca2c181f1470ceb25)
+++ b/src/backend/access/table/tableam.c	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -29,6 +29,7 @@
 #include "storage/bufmgr.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
+#include "storage/proc.h"
 
 /*
  * Constants to control the behavior of block allocation to parallel workers
@@ -149,15 +150,23 @@
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+
+	if (snapshot == InvalidSnapshot)
+	{
+		pscan->phs_snapshot_any = false;
+		pscan->phs_snapshot_reset = true;
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_snapshot_reset = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_snapshot_reset = false;
 	}
 }
 
@@ -170,7 +179,16 @@
 
 	Assert(RelationGetRelid(relation) == pscan->phs_relid);
 
-	if (!pscan->phs_snapshot_any)
+	if (pscan->phs_snapshot_reset)
+	{
+		Assert(!ActiveSnapshotSet());
+		Assert(MyProc->xmin == InvalidTransactionId);
+
+		snapshot = RegisterSnapshot(GetLatestSnapshot());
+		PushActiveSnapshot(snapshot);
+		flags |= (SO_RESET_SNAPSHOT | SO_TEMP_SNAPSHOT);
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
Index: src/backend/access/transam/parallel.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
--- a/src/backend/access/transam/parallel.c	(revision 103bbb703f974c65be6e238ca2c181f1470ceb25)
+++ b/src/backend/access/transam/parallel.c	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_SET_FLAG		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -289,6 +290,9 @@
 							   mul_size(PARALLEL_ERROR_QUEUE_SIZE,
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool), pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
@@ -359,6 +363,7 @@
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -474,6 +479,15 @@
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_set_flag = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_set_flag = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_SET_FLAG, snapshot_set_flag_space);
 	}
 
 	/* Restore previous memory context. */
@@ -511,6 +525,7 @@
 	if (pcxt->nworkers > 0)
 	{
 		char	   *error_queue_space;
+		bool	   *snapshot_set_flag_space;
 		int			i;
 
 		error_queue_space =
@@ -525,6 +540,11 @@
 			shm_mq_set_receiver(mq, MyProc);
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
+
+		snapshot_set_flag_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_SET_FLAG, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_set_flag_space[i] = false;
 	}
 }
 
@@ -669,7 +689,7 @@
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -713,9 +733,12 @@
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_set_flag))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1274,6 +1297,7 @@
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_flag_set_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1449,6 +1473,9 @@
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	snapshot_flag_set_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_SET_FLAG, false);
+	snapshot_flag_set_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
Index: src/include/access/parallel.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
--- a/src/include/access/parallel.h	(revision 103bbb703f974c65be6e238ca2c181f1470ceb25)
+++ b/src/include/access/parallel.h	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
@@ -26,6 +26,7 @@
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_set_flag;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
Index: src/include/access/relscan.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
--- a/src/include/access/relscan.h	(revision 103bbb703f974c65be6e238ca2c181f1470ceb25)
+++ b/src/include/access/relscan.h	(revision ea1fcacc7cead3e2fccf581d20e51244a7107435)
@@ -64,6 +64,7 @@
 {
 	Oid			phs_relid;		/* OID of relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
+	bool		phs_snapshot_reset;
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
Index: src/include/utils/snapmgr.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
--- a/src/include/utils/snapmgr.h	(revision 103bbb703f974c65be6e238ca2c181f1470ceb25)
+++ b/src/include/utils/snapmgr.h	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
@@ -96,6 +96,7 @@
 extern void WaitForOlderSnapshots(TransactionId limitXmin, bool progress);
 extern bool ThereAreNoPriorRegisteredSnapshots(void);
 extern bool HaveRegisteredOrActiveSnapshot(void);
+extern bool HaveRegisteredSnapshot(void);
 
 extern char *ExportSnapshot(Snapshot snapshot);
 
Index: contrib/pgstattuple/pgstattuple.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
--- a/contrib/pgstattuple/pgstattuple.c	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/contrib/pgstattuple/pgstattuple.c	(revision ea1fcacc7cead3e2fccf581d20e51244a7107435)
@@ -286,6 +286,9 @@
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			default:
 				err = "unknown index";
 				break;
@@ -329,7 +332,7 @@
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
Index: src/backend/access/Makefile
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
--- a/src/backend/access/Makefile	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/backend/access/Makefile	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -9,6 +9,6 @@
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  sequence table tablesample transam stir
 
 include $(top_srcdir)/src/backend/common.mk
Index: src/backend/access/meson.build
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
--- a/src/backend/access/meson.build	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/backend/access/meson.build	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -14,3 +14,4 @@
 subdir('table')
 subdir('tablesample')
 subdir('transam')
+subdir('stir')
Index: src/backend/commands/analyze.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
--- a/src/backend/commands/analyze.c	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/backend/commands/analyze.c	(revision 75cd94daf4b0b6147e7f3a386ad1a93fb086653b)
@@ -719,6 +719,7 @@
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
Index: src/backend/commands/vacuumparallel.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
--- a/src/backend/commands/vacuumparallel.c	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/backend/commands/vacuumparallel.c	(revision 75cd94daf4b0b6147e7f3a386ad1a93fb086653b)
@@ -883,6 +883,7 @@
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
Index: src/include/access/genam.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
--- a/src/include/access/genam.h	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/access/genam.h	(revision 75cd94daf4b0b6147e7f3a386ad1a93fb086653b)
@@ -48,6 +48,7 @@
 	bool		analyze_only;	/* ANALYZE (without any actual vacuum) */
 	bool		report_progress;	/* emit progress.h status reports */
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
+	bool		validate_index;		/* not a vacuum but an index validation */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
Index: src/include/access/reloptions.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
--- a/src/include/access/reloptions.h	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/access/reloptions.h	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -51,8 +51,9 @@
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
Index: src/include/catalog/pg_am.dat
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
--- a/src/include/catalog/pg_am.dat	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/catalog/pg_am.dat	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -33,5 +33,7 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
-
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 ]
Index: src/include/catalog/pg_amop.dat
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/pg_amop.dat b/src/include/catalog/pg_amop.dat
--- a/src/include/catalog/pg_amop.dat	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/catalog/pg_amop.dat	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -3227,4 +3227,8 @@
   amoprighttype => 'point', amopstrategy => '7', amopopr => '@>(box,point)',
   amopmethod => 'brin' },
 
+{ amopfamily => 'stir/record_ops', amoplefttype => 'record',
+  amoprighttype => 'record', amopstrategy => '1', amopopr => '=(record,record)',
+  amopmethod => 'stir' },
+
 ]
Index: src/include/catalog/pg_opclass.dat
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
--- a/src/include/catalog/pg_opclass.dat	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/catalog/pg_opclass.dat	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+{ oid => '5557', oid_symbol => 'RECORD_STIR_OPS_OID',
+  opcmethod => 'stir', opcname => 'record_ops', opcfamily => 'stir/record_ops',
+  opcintype => 'record' },
+
 ]
Index: src/include/catalog/pg_opfamily.dat
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
--- a/src/include/catalog/pg_opfamily.dat	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/catalog/pg_opfamily.dat	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -302,6 +302,8 @@
   opfmethod => 'btree', opfname => 'multirange_ops' },
 { oid => '4225',
   opfmethod => 'hash', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'record_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
 
Index: src/include/catalog/pg_proc.dat
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
--- a/src/include/catalog/pg_proc.dat	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/catalog/pg_proc.dat	(revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'just access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
@@ -5487,9 +5491,9 @@
   proname => 'pg_stat_get_activity', prorows => '100', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => 'int4',
-  proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,text,numeric,text,bool,text,bool,bool,int4,int8}',
-  proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc,gss_delegation,leader_pid,query_id}',
+  proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,text,numeric,text,bool,text,bool,bool,int4,int8,xid}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc,gss_delegation,leader_pid,query_id,backend_catalog_xmin}',
   prosrc => 'pg_stat_get_activity' },
 { oid => '6318', descr => 'describe wait events',
   proname => 'pg_get_wait_events', procost => '10', prorows => '250',
Index: src/include/utils/index_selfuncs.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
--- a/src/include/utils/index_selfuncs.h	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/include/utils/index_selfuncs.h	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -70,5 +70,13 @@
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 
 #endif							/* INDEX_SELFUNCS_H */
Index: src/test/regress/expected/amutils.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
--- a/src/test/regress/expected/amutils.out	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/test/regress/expected/amutils.out	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -173,7 +173,13 @@
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
Index: src/test/regress/expected/opr_sanity.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
--- a/src/test/regress/expected/opr_sanity.out	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/test/regress/expected/opr_sanity.out	(revision 75cd94daf4b0b6147e7f3a386ad1a93fb086653b)
@@ -2092,7 +2092,8 @@
        4000 |           28 | ^@
        4000 |           29 | <^
        4000 |           30 | >^
-(124 rows)
+       5555 |            1 | =
+(125 rows)
 
 -- Check that all opclass search operators have selectivity estimators.
 -- This is not absolutely required, but it seems a reasonable thing
Index: src/test/regress/expected/psql.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
--- a/src/test/regress/expected/psql.out	(revision 9817a8ff254bae0291a320bd306d2ec1280f7592)
+++ b/src/test/regress/expected/psql.out	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -5027,7 +5027,8 @@
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5041,7 +5042,8 @@
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5077,7 +5079,8 @@
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement
+(9 rows)
 
 \dA+ *
                              List of access methods
@@ -5091,7 +5094,8 @@
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement
+(9 rows)
 
 \dA+ h*
                      List of access methods
Index: src/backend/catalog/system_views.sql
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
--- a/src/backend/catalog/system_views.sql	(revision b24132f98f93d14c64dfe41973337e13d5e7636b)
+++ b/src/backend/catalog/system_views.sql	(revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -879,6 +879,7 @@
             S.state,
             S.backend_xid,
             s.backend_xmin,
+            s.backend_catalog_xmin,
             S.query_id,
             S.query,
             S.backend_type
Index: src/backend/utils/activity/backend_status.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
--- a/src/backend/utils/activity/backend_status.c	(revision b24132f98f93d14c64dfe41973337e13d5e7636b)
+++ b/src/backend/utils/activity/backend_status.c	(revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -838,6 +838,7 @@
 			ProcNumberGetTransactionIds(procNumber,
 										&localentry->backend_xid,
 										&localentry->backend_xmin,
+										&localentry->backend_catalog_xmin,
 										&localentry->backend_subxact_count,
 										&localentry->backend_subxact_overflowed);
 
Index: src/backend/utils/adt/pgstatfuncs.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
--- a/src/backend/utils/adt/pgstatfuncs.c	(revision b24132f98f93d14c64dfe41973337e13d5e7636b)
+++ b/src/backend/utils/adt/pgstatfuncs.c	(revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -302,7 +302,7 @@
 Datum
 pg_stat_get_activity(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_ACTIVITY_COLS	31
+#define PG_STAT_GET_ACTIVITY_COLS	32
 	int			num_backends = pgstat_fetch_stat_numbackends();
 	int			curr_backend;
 	int			pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
@@ -353,6 +353,11 @@
 		else
 			nulls[15] = true;
 
+		if (TransactionIdIsValid(local_beentry->backend_catalog_xmin))
+			values[31] = TransactionIdGetDatum(local_beentry->backend_catalog_xmin);
+		else
+			nulls[31] = true;
+
 		if (TransactionIdIsValid(local_beentry->backend_xmin))
 			values[16] = TransactionIdGetDatum(local_beentry->backend_xmin);
 		else
Index: src/include/utils/backend_status.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
--- a/src/include/utils/backend_status.h	(revision b24132f98f93d14c64dfe41973337e13d5e7636b)
+++ b/src/include/utils/backend_status.h	(revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -266,6 +266,8 @@
 	 */
 	TransactionId backend_xmin;
 
+	TransactionId backend_catalog_xmin;
+
 	/*
 	 * Number of cached subtransactions in the current session.
 	 */
Index: src/test/regress/expected/rules.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
--- a/src/test/regress/expected/rules.out	(revision b24132f98f93d14c64dfe41973337e13d5e7636b)
+++ b/src/test/regress/expected/rules.out	(revision 6c55d9749e2999542d4e6281db733fdd47930796)
@@ -1759,10 +1759,11 @@
     s.state,
     s.backend_xid,
     s.backend_xmin,
+    s.backend_catalog_xmin,
     s.query_id,
     s.query,
     s.backend_type
-   FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id)
+   FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id, backend_catalog_xmin)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_all_indexes| SELECT c.oid AS relid,
@@ -1882,7 +1883,7 @@
     gss_princ AS principal,
     gss_enc AS encrypted,
     gss_delegation AS credentials_delegated
-   FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id)
+   FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id, backend_catalog_xmin)
   WHERE (client_port IS NOT NULL);
 pg_stat_io| SELECT backend_type,
     object,
@@ -2086,7 +2087,7 @@
     w.sync_priority,
     w.sync_state,
     w.reply_time
-   FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id)
+   FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id, backend_catalog_xmin)
      JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_replication_slots| SELECT s.slot_name,
@@ -2120,7 +2121,7 @@
     ssl_client_dn AS client_dn,
     ssl_client_serial AS client_serial,
     ssl_issuer_dn AS issuer_dn
-   FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id)
+   FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id, backend_catalog_xmin)
   WHERE (client_port IS NOT NULL);
 pg_stat_subscription| SELECT su.oid AS subid,
     su.subname,
Index: src/backend/access/stir/Makefile
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
--- /dev/null	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
+++ b/src/backend/access/stir/Makefile	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
Index: src/backend/access/stir/meson.build
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
--- /dev/null	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
+++ b/src/backend/access/stir/meson.build	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -0,0 +1,5 @@
+# Copyright (c) 2024-2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'stir.c',
+)
Index: src/backend/access/stir/stir.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
--- /dev/null	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
+++ b/src/backend/access/stir/stir.c	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -0,0 +1,517 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * Portions Copyright (c) 2024-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	/* Initialize contents of meta page */
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+	GenericXLogFinish(state);
+
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	GenericXLogState *state;
+	uint16 blkNo;
+
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			state = GenericXLogStart(index);
+			page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+			Assert(!PageIsNew(page));
+
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				GenericXLogFinish(state);
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			/* Didn't fit, must try other pages */
+			GenericXLogAbort(state);
+			UnlockReleaseBuffer(buffer);
+		}
+
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		state = GenericXLogStart(index);
+		metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			// someone else inserted the new page into the index, lets try again
+			GenericXLogAbort(state);
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+
+			page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+			GenericXLogFinish(state);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because TODO
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	metaData = StirPageGetMeta(metaPage);
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		GenericXLogFinish(state);
+	}
+	else
+	{
+		GenericXLogAbort(state);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
Index: src/include/access/stir.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
--- /dev/null	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
+++ b/src/include/access/stir.h	(revision d8df9daea76374468c28f8e9d60d83539aad05c8)
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2024-2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing bloom page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & BLOOM_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
Index: src/backend/utils/sort/tuplesortvariants.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
--- a/src/backend/utils/sort/tuplesortvariants.c	(revision 35f233300cd190b0a17e66f2b4bffa2481e62af9)
+++ b/src/backend/utils/sort/tuplesortvariants.c	(revision bc1fe05f38fbdda049075b9b1dc238bf0d9c240e)
@@ -123,6 +123,7 @@
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool 		uniqueDeadIgnored;
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +350,7 @@
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +393,7 @@
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -514,6 +517,7 @@
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = false;
 	arg->uniqueNullsNotDistinct = false;
+	arg->uniqueDeadIgnored = false;
 
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
@@ -1520,6 +1524,7 @@
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1534,56 @@
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool    			any_tuple_dead,
+								call_again = false,
+								ignored;
+
+			TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+																   &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
 
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
 
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
Index: src/bin/pg_amcheck/t/007_concurrently_unique.pl
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/bin/pg_amcheck/t/007_concurrently_unique.pl b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
new file mode 100644
--- /dev/null	(revision ea1fcacc7cead3e2fccf581d20e51244a7107435)
+++ b/src/bin/pg_amcheck/t/007_concurrently_unique.pl	(revision ea1fcacc7cead3e2fccf581d20e51244a7107435)
@@ -0,0 +1,235 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use IPC::SysV;
+use threads;
+use Test::More;
+use Test::Builder;
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 128MB');
+$node->append_conf('postgresql.conf', 'shared_buffers = 256MB');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i, updated_at)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	# $node->psql('postgres', q(INSERT INTO tbl SELECT i,0,0,0,now() FROM generate_series(1, 1000) s(i);));
+	# while [ $? -eq 0 ]; do make -C src/bin/pg_amcheck/ check PROVE_TESTS='t/007_*' ; done
+
+	$node->pgbench(
+		'--no-vacuum --client=40 --exit-on-abort --transactions=10000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			# Ensure some HOT updates happen
+			'001_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+			),
+			'002_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*100,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+			'003_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+			'004_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	my $pg_bench_fork_flag;
+	while (1) {
+		shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+		sleep(0.1);
+		last if $pg_bench_fork_flag eq "stop";
+	}
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr, $n, $stderr_saved);
+
+#		ok(send_query_and_wait(\%psql, q[SELECT pg_sleep(10);], qr/^.*$/m), 'SELECT');
+
+		while (1)
+		{
+
+			if (int(rand(2)) == 0) {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+			} else {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=0);));
+			}
+			is($result, '0', 'ALTER TABLE is correct');
+
+
+			if (1)
+			{
+				my $sql = q(select pg_sleep(0); CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+
+				($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+				is($result, '0', 'CREATE INDEX is correct');
+				$stderr_saved = $stderr;
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+				is($result, '0', 'bt_index_check for new index is correct');
+				if ($result)
+				{
+					diag($stderr);
+					diag($stderr_saved);
+					BAIL_OUT($stderr);
+				} else {
+					diag('create:)' . $n++);
+				}
+
+				if (1)
+				{
+					($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+					is($result, '0', 'REINDEX 2 is correct');
+					if ($result) {
+						diag($stderr);
+						BAIL_OUT($stderr);
+					}
+
+					($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+					is($result, '0', 'bt_index_check 2 is correct');
+					if ($result)
+					{
+						diag($stderr);
+						BAIL_OUT($stderr);
+					} else {
+						diag('reindex2:)' . $n++);
+					}
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+				is($result, '0', 'DROP INDEX is correct');
+			}
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+	$child->finalize();
+	$child->summary();
+	$node->stop;
+	done_testing();
+
+	shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+	my ($psql, $query, $untl) = @_;
+	my $ret;
+
+	# For each query we run, we'll restart the timeout.  Otherwise the timeout
+	# would apply to the whole test script, and would need to be set very high
+	# to survive when running under Valgrind.
+	$psql_timeout->reset();
+	$psql_timeout->start();
+
+	# send query
+	$$psql{stdin} .= $query;
+	$$psql{stdin} .= "\n";
+
+	# wait for query results
+	$$psql{run}->pump_nb();
+	while (1)
+	{
+		last if $$psql{stdout} =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		if (not $$psql{run}->pumpable())
+		{
+			diag("aborting wait: program died\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		$$psql{run}->pump();
+	}
+
+	$$psql{stdout} = '';
+
+	return 1;
+}
Index: src/include/access/transam.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
--- a/src/include/access/transam.h	(revision 35f233300cd190b0a17e66f2b4bffa2481e62af9)
+++ b/src/include/access/transam.h	(revision 3a0fa65e328d51b6c97b44a72778b6ee21fe4478)
@@ -344,6 +344,21 @@
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the older of the two IDs, assuming they're both normal */
 static inline TransactionId
 NormalTransactionIdOlder(TransactionId a, TransactionId b)
Index: src/include/utils/tuplesort.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
--- a/src/include/utils/tuplesort.h	(revision 35f233300cd190b0a17e66f2b4bffa2481e62af9)
+++ b/src/include/utils/tuplesort.h	(revision 3a0fa65e328d51b6c97b44a72778b6ee21fe4478)
@@ -428,6 +428,7 @@
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
Index: src/backend/access/nbtree/nbtutils.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
--- a/src/backend/access/nbtree/nbtutils.c	(revision 3a0fa65e328d51b6c97b44a72778b6ee21fe4478)
+++ b/src/backend/access/nbtree/nbtutils.c	(revision bc1fe05f38fbdda049075b9b1dc238bf0d9c240e)
@@ -100,8 +100,6 @@
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4775,6 +4773,14 @@
 	return tidpivot;
 }
 
+int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+				BTScanInsert itup_key) {
+	bool ignored;
+	return _bt_keep_natts_wasnull(rel, lastleft, firstright, itup_key, &ignored);
+}
+
+
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
@@ -4786,9 +4792,10 @@
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
-_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+int
+_bt_keep_natts_wasnull(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   BTScanInsert itup_key,
+			   bool *wasnull)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4814,6 +4821,7 @@
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		(*wasnull) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4838,6 +4846,13 @@
 	return keepnatts;
 }
 
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	bool ignored;
+	return _bt_keep_natts_fast_wasnull(rel, lastleft, firstright, &ignored);
+}
+
 /*
  * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
  *
@@ -4861,7 +4876,8 @@
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast_wasnull(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+							bool *wasnull)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4878,6 +4894,7 @@
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		*wasnull |= (isNull1 | isNull2);
 		att = TupleDescAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
Index: src/include/access/nbtree.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
--- a/src/include/access/nbtree.h	(revision 3a0fa65e328d51b6c97b44a72778b6ee21fe4478)
+++ b/src/include/access/nbtree.h	(revision bc1fe05f38fbdda049075b9b1dc238bf0d9c240e)
@@ -1302,8 +1302,15 @@
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
+							 IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts_wasnull(Relation rel, IndexTuple lastleft,
+							 IndexTuple firstright, BTScanInsert itup_key,
+							 bool *wasnull);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
 								IndexTuple firstright);
+extern int	_bt_keep_natts_fast_wasnull(Relation rel, IndexTuple lastleft,
+								  IndexTuple firstright, bool *wasnull);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
Index: src/backend/optimizer/util/plancat.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
--- a/src/backend/optimizer/util/plancat.c	(revision bc1fe05f38fbdda049075b9b1dc238bf0d9c240e)
+++ b/src/backend/optimizer/util/plancat.c	(revision 94aa5d7dab7e8ebd77004b50ba96b1f82a04c249)
@@ -720,6 +720,7 @@
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -813,7 +814,13 @@
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -835,10 +842,9 @@
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
 
 			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
+			foundValid |= idxForm->indisvalid;
 			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			break;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
@@ -932,6 +938,7 @@
 			goto next;
 
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -939,7 +946,8 @@
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
Index: src/backend/access/index/genam.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
--- a/src/backend/access/index/genam.c	(revision 94aa5d7dab7e8ebd77004b50ba96b1f82a04c249)
+++ b/src/backend/access/index/genam.c	(revision ea1fcacc7cead3e2fccf581d20e51244a7107435)
@@ -454,7 +454,7 @@
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
Index: src/test/modules/injection_points/Makefile
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
--- a/src/test/modules/injection_points/Makefile	(revision 56c9d3f4842baa53d7ab13d0764eae7f305aba0f)
+++ b/src/test/modules/injection_points/Makefile	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -13,7 +13,8 @@
 REGRESS = injection_points
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = inplace
+ISOLATION = inplace \
+			reset_snapshots
 
 TAP_TESTS = 1
 
Index: src/test/modules/injection_points/expected/reset_snapshots.out
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/modules/injection_points/expected/reset_snapshots.out b/src/test/modules/injection_points/expected/reset_snapshots.out
new file mode 100644
--- /dev/null	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
+++ b/src/test/modules/injection_points/expected/reset_snapshots.out	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -0,0 +1,318 @@
+unused step name: sleep
+Parsed test spec with 2 sessions
+
+starting permutation: set_parallel_workers_1 create_index_concurrently_simple reindex_index_concurrently drop_index detach
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step set_parallel_workers_1: ALTER TABLE test.tbl SET (parallel_workers=0);
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_index_concurrently_simple: CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j);
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx;
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach: 
+	SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+	SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+	SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+
+starting permutation: set_parallel_workers_1 create_unique_index_concurrently_simple reindex_index_concurrently drop_index detach
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step set_parallel_workers_1: ALTER TABLE test.tbl SET (parallel_workers=0);
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_unique_index_concurrently_simple: CREATE UNIQUE INDEX CONCURRENTLY idx ON test.tbl(i);
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx;
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach: 
+	SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+	SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+	SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+
+starting permutation: set_parallel_workers_1 create_index_concurrently_predicate_expression_mod reindex_index_concurrently drop_index detach
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step set_parallel_workers_1: ALTER TABLE test.tbl SET (parallel_workers=0);
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_index_concurrently_predicate_expression_mod: CREATE INDEX CONCURRENTLY idx ON test.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx;
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach: 
+	SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+	SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+	SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+
+starting permutation: set_parallel_workers_1 create_index_concurrently_predicate_set_xid_no_param reindex_index_concurrently drop_index detach
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step set_parallel_workers_1: ALTER TABLE test.tbl SET (parallel_workers=0);
+step create_index_concurrently_predicate_set_xid_no_param: CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j) WHERE test.predicate_stable_no_param();
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx;
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach: 
+	SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+	SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+	SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+
+starting permutation: set_parallel_workers_1 create_index_concurrently_predicate_set_xid reindex_index_concurrently drop_index detach
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step set_parallel_workers_1: ALTER TABLE test.tbl SET (parallel_workers=0);
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_index_concurrently_predicate_set_xid: CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j) WHERE test.predicate_stable(i);
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx;
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach: 
+	SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+	SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+	SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+
+starting permutation: set_parallel_workers_2 create_index_concurrently_simple wakeup reindex_index_concurrently wakeup drop_index detach
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step set_parallel_workers_2: ALTER TABLE test.tbl SET (parallel_workers=2);
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+step create_index_concurrently_simple: CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j); <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_index_concurrently_simple: <... completed>
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx; <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: <... completed>
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach: 
+	SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+	SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+	SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+
+starting permutation: set_parallel_workers_2 create_unique_index_concurrently_simple wakeup reindex_index_concurrently wakeup drop_index detach
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step set_parallel_workers_2: ALTER TABLE test.tbl SET (parallel_workers=2);
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+step create_unique_index_concurrently_simple: CREATE UNIQUE INDEX CONCURRENTLY idx ON test.tbl(i); <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_unique_index_concurrently_simple: <... completed>
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx; <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: <... completed>
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach: 
+	SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+	SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+	SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+
+starting permutation: set_parallel_workers_2 create_index_concurrently_predicate_expression_mod wakeup reindex_index_concurrently wakeup drop_index detach
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step set_parallel_workers_2: ALTER TABLE test.tbl SET (parallel_workers=2);
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+step create_index_concurrently_predicate_expression_mod: CREATE INDEX CONCURRENTLY idx ON test.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0; <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step create_index_concurrently_predicate_expression_mod: <... completed>
+test: NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+step reindex_index_concurrently: REINDEX INDEX CONCURRENTLY test.idx; <waiting ...>
+step wakeup: SELECT injection_points_wakeup('_bt_leader_participate_as_worker');
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+test: NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
+step reindex_index_concurrently: <... completed>
+step drop_index: DROP INDEX CONCURRENTLY test.idx;
+step detach: 
+	SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+	SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+	SELECT injection_points_detach('_bt_leader_participate_as_worker');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
Index: src/test/modules/injection_points/meson.build
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
--- a/src/test/modules/injection_points/meson.build	(revision 56c9d3f4842baa53d7ab13d0764eae7f305aba0f)
+++ b/src/test/modules/injection_points/meson.build	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -42,6 +42,7 @@
   'isolation': {
     'specs': [
       'inplace',
+      'reset_snapshots',
     ],
   },
   'tap': {
Index: src/test/modules/injection_points/specs/reset_snapshots.spec
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/modules/injection_points/specs/reset_snapshots.spec b/src/test/modules/injection_points/specs/reset_snapshots.spec
new file mode 100644
--- /dev/null	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
+++ b/src/test/modules/injection_points/specs/reset_snapshots.spec	(revision 3dea72b62adc8806917dc459b82ff44d962bcb12)
@@ -0,0 +1,114 @@
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE TABLE test.tbl(i int primary key, j int);
+	INSERT INTO test.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+	CREATE FUNCTION test.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+									  BEGIN
+										EXECUTE 'SELECT txid_current()';
+										RETURN MOD($1, 2) = 0;
+									  END; $$;
+
+	CREATE FUNCTION test.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+									  BEGIN
+										EXECUTE 'SELECT txid_current()';
+										RETURN false;
+									  END; $$;
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session test
+setup	{
+	SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+	SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+	SELECT injection_points_attach('_bt_leader_participate_as_worker', 'wait');
+}
+step sleep { SELECT pg_sleep(10); }
+step drop_index { DROP INDEX CONCURRENTLY test.idx; }
+step create_index_concurrently_simple	{ CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j); }
+step create_unique_index_concurrently_simple	{ CREATE UNIQUE INDEX CONCURRENTLY idx ON test.tbl(i); }
+step create_index_concurrently_predicate_expression_mod	{ CREATE INDEX CONCURRENTLY idx ON test.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0; }
+step create_index_concurrently_predicate_set_xid	{ CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j) WHERE test.predicate_stable(i); }
+step create_index_concurrently_predicate_set_xid_no_param	{ CREATE INDEX CONCURRENTLY idx ON test.tbl(i, j) WHERE test.predicate_stable_no_param(); }
+step reindex_index_concurrently { REINDEX INDEX CONCURRENTLY test.idx; }
+step set_parallel_workers_1 { ALTER TABLE test.tbl SET (parallel_workers=0); }
+step set_parallel_workers_2 { ALTER TABLE test.tbl SET (parallel_workers=2); }
+step detach {
+	SELECT injection_points_detach('heapam_index_validate_scan_no_xid');
+	SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+	SELECT injection_points_detach('_bt_leader_participate_as_worker');
+}
+
+session wakeup_session
+step wakeup { SELECT injection_points_wakeup('_bt_leader_participate_as_worker'); }
+
+permutation
+	set_parallel_workers_1
+	create_index_concurrently_simple
+	reindex_index_concurrently
+	drop_index
+	detach
+
+permutation
+	set_parallel_workers_1
+	create_unique_index_concurrently_simple
+	reindex_index_concurrently
+	drop_index
+	detach
+
+permutation
+	set_parallel_workers_1
+	create_index_concurrently_predicate_expression_mod
+	reindex_index_concurrently
+	drop_index
+	detach
+
+permutation
+	set_parallel_workers_1
+	create_index_concurrently_predicate_set_xid_no_param
+	reindex_index_concurrently
+	drop_index
+	detach
+
+permutation
+	set_parallel_workers_1
+	create_index_concurrently_predicate_set_xid
+	reindex_index_concurrently
+	drop_index
+	detach
+
+permutation
+	set_parallel_workers_2
+	create_index_concurrently_simple
+	wakeup
+	reindex_index_concurrently
+	wakeup
+	drop_index
+	detach
+
+permutation
+	set_parallel_workers_2
+	create_unique_index_concurrently_simple
+	wakeup
+	reindex_index_concurrently
+	wakeup
+	drop_index
+	detach
+
+permutation
+	set_parallel_workers_2
+	create_index_concurrently_predicate_expression_mod
+	wakeup
+	reindex_index_concurrently
+	wakeup
+	drop_index
+	detach
\ No newline at end of file


^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-11-12 15:00  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-11-12 15:00 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, everyone!

With winter approaching, it’s the perfect time to dive back into work on
this patch! :)

The first attached patch implements Matthias's idea of periodically
resetting the snapshot during the initial heap scan. The next step will be
to add support for parallel builds.

Additionally, here are a few comments on previous emails:

> In heapam_index_build_range_scan, it seems like you're popping the
> snapshot and registering a new one while holding a tuple from
> heap_getnext(), thus while holding a page lock. I'm not so sure that's
> OK, expecially when catalogs are also involved (specifically for
> expression indexes, where functions could potentially be updated or
> dropped if we re-create the visibility snapshot)

Now, visibility snapshots are updated between pages.

As for the catalog snapshot:
* Dropping functions isn’t possible due to dependencies and locking
constraints.

* Updating functions is possible, but it offers the same level of isolation
as we have now:
1) Functions are already converted into an execution state and aren’t
re-read from the catalog during the scan.
2) During the validation phase, the latest version of a function will be
used.
3) Even in the initial phase, predicates and expressions could be read
using different catalog snapshots, as it’s possible to receive a cache
invalidation message before the first FormIndexDatum is created.

Best regards,
Mikhail.

>


Attachments:

  [text/x-patch] v1-0001-Allow-advancing-xmin-during-non-unique-non-parall.patch (31.8K, 3-v1-0001-Allow-advancing-xmin-during-non-unique-non-parall.patch)
  download | inline diff:
From f0ad209453b645728570a1f57b364517bcfdf734 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 12 Nov 2024 13:09:29 +0100
Subject: [PATCH v1] Allow advancing xmin during non-unique, non-parallel
 concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.

Author: Michail Nikolaev
Reviewed-by: [Reviewers' Names]
Discussion: https://postgr.es/m/CANtu0oiLc-%2B7h9zfzOVy2cv2UuYk_5MUReVLnVbOay6OgD_KGg%40mail.gmail.com
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |   4 +
 src/backend/access/heap/heapam.c              |  37 +++++++
 src/backend/access/heap/heapam_handler.c      |  45 ++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |   4 +
 src/backend/catalog/index.c                   |  30 +++++-
 src/backend/commands/indexcmds.c              |  10 --
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  27 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 102 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  82 ++++++++++++++
 15 files changed, 332 insertions(+), 28 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 8b82797c10..23c138db0a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4..ff7cc07df9 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c0b978119a..94c086073e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2430,8 +2430,12 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	if (IsMVCCSnapshot(snapshot))
+		PushActiveSnapshot(snapshot);
 	/* Everyone's had a chance to ask for space, so now create the DSM */
 	InitializeParallelDSM(pcxt);
+	if (IsMVCCSnapshot(snapshot))
+		PopActiveSnapshot();
 
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dc..21a2515de3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,28 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	Assert(ActiveSnapshotSet());
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	PopActiveSnapshot();
+	UnregisterSnapshot(sscan->rs_snapshot);
+	sscan->rs_snapshot = InvalidSnapshot;
+	InvalidateCatalogSnapshotConditionally();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	sscan->rs_snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(sscan->rs_snapshot);
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -607,7 +630,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1233,6 +1262,14 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT);
+		Assert(ActiveSnapshotSet());
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		PopActiveSnapshot();
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c..5a1d0a9d36 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1244,24 +1243,40 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
+			/*
+			 * For unique index we need consistent snapshot for the whole scan.
+			 * In case of parallel scan some additional infrastructure required
+			 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+			 */
+			reset_snapshots = indexInfo->ii_Concurrent &&
+							  !indexInfo->ii_Unique &&
+							  !is_system_catalog; /* just for the case */
+			Assert(!ActiveSnapshotSet());
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 */
 			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			PushActiveSnapshot(snapshot);
+			/* In case of SO_RESET_SNAPSHOT snapshots are cleared by table_endscan. */
+			need_unregister_snapshot = need_pop_active_snapshot = !reset_snapshots;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1290,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1306,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1748,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1822,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 60c61039d6..777df91972 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -461,7 +461,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index fb9a05f7af..e7ccefb133 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1485,8 +1485,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	else
 		querylen = 0;			/* keep compiler quiet */
 
+	if (IsMVCCSnapshot(snapshot))
+		PushActiveSnapshot(snapshot);
 	/* Everyone's had a chance to ask for space, so now create the DSM */
 	InitializeParallelDSM(pcxt);
+	if (IsMVCCSnapshot(snapshot))
+		PopActiveSnapshot();
 
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f9bb721c5f..3aa500072c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be registered every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2f652463e3..df5873e124 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1671,15 +1671,9 @@ DefineIndex(Oid tableId,
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4076,9 +4070,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4093,7 +4084,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1f78dc3d53..6b75c14c69 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6890,6 +6891,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6945,6 +6947,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7002,6 +7009,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93c..dc7c766661 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,18 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot each page? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped and
+	 * unregistered, catalog snapshot invalidated, latest snapshot is
+	 * registered and pushed as active.
+	 *
+	 * At the end of the scan snapshot is popped and unregistered too.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +948,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +957,13 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		Assert(ActiveSnapshotSet());
+		Assert(GetActiveSnapshot() == snapshot);
+		flags |= (SO_RESET_SNAPSHOT | SO_TEMP_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1779,6 +1800,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58..2225cd0bf8 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 0000000000..4cfbbb0592
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,102 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f1900115..44cc028e82 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 0000000000..4fef5a4743
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,82 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-12-02 01:39  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-12-02 01:39 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello!

Added support for parallel builds (resetting in the first phase), next step
- support for unique indexes.

Best regards,
Mikhail.

>

From fc79ec8084837e1792441b1dae1594986dba0caa Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v2 4/4] Allow snapshot resets during parallel concurrent index
 builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
  proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 43 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 +++--
 src/backend/access/nbtree/nbtsort.c           | 38 ++++++++++++--
 src/backend/access/table/tableam.c            | 37 ++++++++++++--
 src/backend/access/transam/parallel.c         | 50 +++++++++++++++++--
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 ++--
 .../expected/cic_reset_snapshots.out          | 23 ++++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 12 files changed, 178 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3a7357a050d..148e1982cad 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -291,14 +291,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 4cfbbb05923..49ef68d9071 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,27 +78,40 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 4fef5a47431..5d1c31493f0 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -79,4 +82,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0


From 9432da61d7640457a67cc5ac8ecd0b1c6be132e1 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v2 1/4] this is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 ++++++++-
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++---
 src/backend/utils/time/snapmgr.c              |   2 +
 src/test/modules/injection_points/Makefile    |   7 +-
 .../expected/index_concurrently_upsert.out    |  80 ++++++
 .../index_concurrently_upsert_predicate.out   |  80 ++++++
 .../expected/reindex_concurrently_upsert.out  | 238 ++++++++++++++++++
 ...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
 ...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |  11 +
 .../specs/index_concurrently_upsert.spec      |  68 +++++
 .../index_concurrently_upsert_predicate.spec  |  70 ++++++
 .../specs/reindex_concurrently_upsert.spec    |  86 +++++++
 ...dex_concurrently_upsert_on_constraint.spec |  86 +++++++
 ...index_concurrently_upsert_partitioned.spec |  88 +++++++
 18 files changed, 1505 insertions(+), 50 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 37b0ca2e439..5ffef4595e2 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -713,12 +713,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -753,8 +755,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -766,30 +768,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -812,7 +860,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -832,27 +886,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -872,7 +922,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -880,6 +930,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -917,27 +971,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -945,7 +1007,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f20..3a7357a050d 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -426,6 +427,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
 REGRESS = injection_points reindex_conc
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+			reindex_concurrently_upsert \
+			index_concurrently_upsert \
+			reindex_concurrently_upsert_partitioned \
+			reindex_concurrently_upsert_on_constraint \
+			index_concurrently_upsert_predicate
 
 TAP_TESTS = 1
 
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
     'specs': [
       'basic',
       'inplace',
+      'reindex_concurrently_upsert',
+      'index_concurrently_upsert',
+      'reindex_concurrently_upsert_partitioned',
+      'reindex_concurrently_upsert_on_constraint',
+      'index_concurrently_upsert_predicate',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
+    # We waiting for all snapshots, so, avoid parallel test executions
+    'runningcheck-parallel': false,
   },
   'tap': {
     'env': {
@@ -53,5 +62,7 @@ tests += {
     'tests': [
       't/001_stats.pl',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
   },
 }
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+	CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+	CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+		FOR VALUES FROM (0) TO (10000)
+		WITH (parallel_workers = 0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
-- 
2.43.0


From c8e63c35e9ac09b71d53ddc4e5d4dd2b1ec31cb6 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v2 3/4] Allow advancing xmin during non-unique, non-parallel 
 concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  14 +++
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  14 +++
 src/backend/catalog/index.c                   |  30 +++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 102 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  82 ++++++++++++++
 15 files changed, 375 insertions(+), 31 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 60c61039d66..777df91972e 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -461,7 +461,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1c3a9e06d37..f581a743aae 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762ec..d9de16af81d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6942,6 +6943,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6997,6 +6999,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7054,6 +7061,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..4cfbbb05923
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,102 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..4fef5a47431
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,82 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0


From 53cfcf3dc0effd2b1a41195d01207f46bac6df86 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v2 2/4] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck bt_index_parent_check
* Exercising parallel worker configurations

The tests perform intensive concurrent modifications via pgbench while
executing index operations to stress test index build infrastructure.
Test cases cover:
- Regular and unique indexes
- Indexes with stable and immutable predicates
- Multi-column indexes with various combinations
- Different parallel worker configurations

Two new test files added:
- t/006_concurrently.pl: General concurrent index operation tests
- t/007_concurrently_unique.pl: Focused testing of unique indexes

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build                |   2 +
 src/bin/pg_amcheck/t/006_concurrently.pl      | 315 ++++++++++++++++++
 .../pg_amcheck/t/007_concurrently_unique.pl   | 239 +++++++++++++
 3 files changed, 556 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl
 create mode 100644 src/bin/pg_amcheck/t/007_concurrently_unique.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..b4e14a15ef3 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,8 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_concurrently.pl',
+      't/007_concurrently_unique.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 00000000000..c0f9e9557bf
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,315 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+
+use threads;
+use Test::More;
+use Test::Builder;
+
+
+eval {
+	require IPC::SysV;
+	IPC::SysV->import(qw(IPC_CREAT IPC_EXCL S_IRUSR S_IWUSR));
+};
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	$node->pgbench(
+		'--no-vacuum --client=10 --transactions=1000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			'001_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			'002_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			# Ensure some HOT updates happen
+			'003_pgbench_concurrent_transaction_updates' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  )
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	my $pg_bench_fork_flag;
+	while (1) {
+		shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+		sleep(0.1);
+		last if $pg_bench_fork_flag eq "stop";
+	}
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr, $n, $stderr_saved);
+		$n = 0;
+
+		$node->psql('postgres', q(CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+                                  LANGUAGE plpgsql AS $$
+                                  BEGIN
+                                    EXECUTE 'SELECT txid_current()';
+                                    RETURN true;
+                                  END; $$;));
+
+		$node->psql('postgres', q(CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+                                  LANGUAGE plpgsql AS $$
+                                  BEGIN
+                                    RETURN MOD($1, 2) = 0;
+                                  END; $$;));
+		while (1)
+		{
+
+			if (int(rand(2)) == 0) {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=1);));
+			} else {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+			}
+			is($result, '0', 'ALTER TABLE is correct');
+
+			if (1)
+			{
+				($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+				is($result, '0', 'REINDEX is correct');
+
+				if ($result) {
+					diag($stderr);
+					BAIL_OUT($stderr);
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true, rootdescend => true, checkunique => true);));
+				is($result, '0', 'bt_index_check is correct');
+				if ($result)
+				{
+					diag($stderr);
+					BAIL_OUT($stderr);
+				} else {
+					diag('#reindex:)' . $n++);
+				}
+			}
+
+			if (1)
+			{
+				my $variant = int(rand(7));
+				my $sql;
+				if ($variant == 0) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at););
+				} elsif ($variant == 1) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable(););
+				} elsif ($variant == 2) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;);
+				} elsif ($variant == 3) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i););
+				} elsif ($variant == 4) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i)););
+				} elsif ($variant == 5) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i););
+				} elsif ($variant == 6) {
+					$sql = q(CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+				} else { diag("#wrong variant"); }
+
+				diag('#' . $sql);
+				($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+				is($result, '0', 'CREATE INDEX is correct');
+				$stderr_saved = $stderr;
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+				is($result, '0', 'bt_index_check for new index is correct');
+				if ($result)
+				{
+					diag($stderr);
+					diag($stderr_saved);
+					BAIL_OUT($stderr);
+				} else {
+					diag('#create:)' . $n++);
+				}
+
+				if (1)
+				{
+					($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+					is($result, '0', 'REINDEX 2 is correct');
+					if ($result) {
+						diag($stderr);
+						BAIL_OUT($stderr);
+					}
+
+					($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+					is($result, '0', 'bt_index_check 2 is correct');
+					if ($result)
+					{
+						diag($stderr);
+						BAIL_OUT($stderr);
+					} else {
+						diag('#reindex2:)' . $n++);
+					}
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+				is($result, '0', 'DROP INDEX is correct');
+			}
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+	$child->finalize();
+	$child->summary();
+	$node->stop;
+	done_testing();
+
+	shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+	my ($psql, $query, $untl) = @_;
+	my $ret;
+
+	# For each query we run, we'll restart the timeout.  Otherwise the timeout
+	# would apply to the whole test script, and would need to be set very high
+	# to survive when running under Valgrind.
+	$psql_timeout->reset();
+	$psql_timeout->start();
+
+	# send query
+	$$psql{stdin} .= $query;
+	$$psql{stdin} .= "\n";
+
+	# wait for query results
+	$$psql{run}->pump_nb();
+	while (1)
+	{
+		last if $$psql{stdout} =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		if (not $$psql{run}->pumpable())
+		{
+			diag("aborting wait: program died\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		$$psql{run}->pump();
+	}
+
+	$$psql{stdout} = '';
+
+	return 1;
+}
diff --git a/src/bin/pg_amcheck/t/007_concurrently_unique.pl b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
new file mode 100644
index 00000000000..22cd3b4bf2b
--- /dev/null
+++ b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
@@ -0,0 +1,239 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use threads;
+use Test::More;
+use Test::Builder;
+
+eval {
+	require IPC::SysV;
+	IPC::SysV->import(qw(IPC_CREAT IPC_EXCL S_IRUSR S_IWUSR));
+};
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 128MB');
+$node->append_conf('postgresql.conf', 'shared_buffers = 256MB');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i, updated_at)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	# $node->psql('postgres', q(INSERT INTO tbl SELECT i,0,0,0,now() FROM generate_series(1, 1000) s(i);));
+	# while [ $? -eq 0 ]; do make -C src/bin/pg_amcheck/ check PROVE_TESTS='t/007_*' ; done
+
+	$node->pgbench(
+		'--no-vacuum --client=40 --exit-on-abort --transactions=1000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			# Ensure some HOT updates happen
+			'001_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+			),
+			'002_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*100,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+			'003_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+			'004_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	my $pg_bench_fork_flag;
+	while (1) {
+		shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+		sleep(0.1);
+		last if $pg_bench_fork_flag eq "stop";
+	}
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr, $n, $stderr_saved);
+
+#		ok(send_query_and_wait(\%psql, q[SELECT pg_sleep(10);], qr/^.*$/m), 'SELECT');
+
+		while (1)
+		{
+
+			if (int(rand(2)) == 0) {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+			} else {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=0);));
+			}
+			is($result, '0', 'ALTER TABLE is correct');
+
+
+			if (1)
+			{
+				my $sql = q(select pg_sleep(0); CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+
+				($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+				is($result, '0', 'CREATE INDEX is correct');
+				$stderr_saved = $stderr;
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+				is($result, '0', 'bt_index_check for new index is correct');
+				if ($result)
+				{
+					diag($stderr);
+					diag($stderr_saved);
+					BAIL_OUT($stderr);
+				} else {
+					diag('#create:)' . $n++);
+				}
+
+				if (1)
+				{
+					($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+					is($result, '0', 'REINDEX 2 is correct');
+					if ($result) {
+						diag($stderr);
+						BAIL_OUT($stderr);
+					}
+
+					($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+					is($result, '0', 'bt_index_check 2 is correct');
+					if ($result)
+					{
+						diag($stderr);
+						BAIL_OUT($stderr);
+					} else {
+						diag('#reindex2:)' . $n++);
+					}
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+				is($result, '0', 'DROP INDEX is correct');
+			}
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+	$child->finalize();
+	$child->summary();
+	$node->stop;
+	done_testing();
+
+	shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+	my ($psql, $query, $untl) = @_;
+	my $ret;
+
+	# For each query we run, we'll restart the timeout.  Otherwise the timeout
+	# would apply to the whole test script, and would need to be set very high
+	# to survive when running under Valgrind.
+	$psql_timeout->reset();
+	$psql_timeout->start();
+
+	# send query
+	$$psql{stdin} .= $query;
+	$$psql{stdin} .= "\n";
+
+	# wait for query results
+	$$psql{run}->pump_nb();
+	while (1)
+	{
+		last if $$psql{stdout} =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		if (not $$psql{run}->pumpable())
+		{
+			diag("aborting wait: program died\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		$$psql{run}->pump();
+	}
+
+	$$psql{stdout} = '';
+
+	return 1;
+}
-- 
2.43.0



Attachments:

  [text/plain] v2-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch (29.2K, 3-v2-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch)
  download | inline diff:
From fc79ec8084837e1792441b1dae1594986dba0caa Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v2 4/4] Allow snapshot resets during parallel concurrent index
 builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
  proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 43 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 +++--
 src/backend/access/nbtree/nbtsort.c           | 38 ++++++++++++--
 src/backend/access/table/tableam.c            | 37 ++++++++++++--
 src/backend/access/transam/parallel.c         | 50 +++++++++++++++++--
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 ++--
 .../expected/cic_reset_snapshots.out          | 23 ++++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 12 files changed, 178 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3a7357a050d..148e1982cad 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -291,14 +291,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 4cfbbb05923..49ef68d9071 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,27 +78,40 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 4fef5a47431..5d1c31493f0 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -79,4 +82,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [text/plain] v2-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch (61.5K, 4-v2-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch)
  download | inline diff:
From 9432da61d7640457a67cc5ac8ecd0b1c6be132e1 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v2 1/4] this is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 ++++++++-
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++---
 src/backend/utils/time/snapmgr.c              |   2 +
 src/test/modules/injection_points/Makefile    |   7 +-
 .../expected/index_concurrently_upsert.out    |  80 ++++++
 .../index_concurrently_upsert_predicate.out   |  80 ++++++
 .../expected/reindex_concurrently_upsert.out  | 238 ++++++++++++++++++
 ...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
 ...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |  11 +
 .../specs/index_concurrently_upsert.spec      |  68 +++++
 .../index_concurrently_upsert_predicate.spec  |  70 ++++++
 .../specs/reindex_concurrently_upsert.spec    |  86 +++++++
 ...dex_concurrently_upsert_on_constraint.spec |  86 +++++++
 ...index_concurrently_upsert_partitioned.spec |  88 +++++++
 18 files changed, 1505 insertions(+), 50 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 37b0ca2e439..5ffef4595e2 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -713,12 +713,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -753,8 +755,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -766,30 +768,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -812,7 +860,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -832,27 +886,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -872,7 +922,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -880,6 +930,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -917,27 +971,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -945,7 +1007,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f20..3a7357a050d 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -426,6 +427,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
 REGRESS = injection_points reindex_conc
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+			reindex_concurrently_upsert \
+			index_concurrently_upsert \
+			reindex_concurrently_upsert_partitioned \
+			reindex_concurrently_upsert_on_constraint \
+			index_concurrently_upsert_predicate
 
 TAP_TESTS = 1
 
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
     'specs': [
       'basic',
       'inplace',
+      'reindex_concurrently_upsert',
+      'index_concurrently_upsert',
+      'reindex_concurrently_upsert_partitioned',
+      'reindex_concurrently_upsert_on_constraint',
+      'index_concurrently_upsert_predicate',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
+    # We waiting for all snapshots, so, avoid parallel test executions
+    'runningcheck-parallel': false,
   },
   'tap': {
     'env': {
@@ -53,5 +62,7 @@ tests += {
     'tests': [
       't/001_stats.pl',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
   },
 }
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+	CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+	CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+		FOR VALUES FROM (0) TO (10000)
+		WITH (parallel_workers = 0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
-- 
2.43.0



  [text/plain] v2-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch (35.8K, 5-v2-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch)
  download | inline diff:
From c8e63c35e9ac09b71d53ddc4e5d4dd2b1ec31cb6 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v2 3/4] Allow advancing xmin during non-unique, non-parallel 
 concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  14 +++
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  14 +++
 src/backend/catalog/index.c                   |  30 +++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 102 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  82 ++++++++++++++
 15 files changed, 375 insertions(+), 31 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 60c61039d66..777df91972e 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -461,7 +461,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1c3a9e06d37..f581a743aae 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762ec..d9de16af81d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6942,6 +6943,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6997,6 +6999,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7054,6 +7061,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..4cfbbb05923
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,102 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..4fef5a47431
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,82 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [text/plain] v2-0002-Add-stress-tests-for-concurrent-index-operations.patch (20.3K, 6-v2-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 53cfcf3dc0effd2b1a41195d01207f46bac6df86 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v2 2/4] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck bt_index_parent_check
* Exercising parallel worker configurations

The tests perform intensive concurrent modifications via pgbench while
executing index operations to stress test index build infrastructure.
Test cases cover:
- Regular and unique indexes
- Indexes with stable and immutable predicates
- Multi-column indexes with various combinations
- Different parallel worker configurations

Two new test files added:
- t/006_concurrently.pl: General concurrent index operation tests
- t/007_concurrently_unique.pl: Focused testing of unique indexes

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build                |   2 +
 src/bin/pg_amcheck/t/006_concurrently.pl      | 315 ++++++++++++++++++
 .../pg_amcheck/t/007_concurrently_unique.pl   | 239 +++++++++++++
 3 files changed, 556 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl
 create mode 100644 src/bin/pg_amcheck/t/007_concurrently_unique.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..b4e14a15ef3 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,8 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_concurrently.pl',
+      't/007_concurrently_unique.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 00000000000..c0f9e9557bf
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,315 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+
+use threads;
+use Test::More;
+use Test::Builder;
+
+
+eval {
+	require IPC::SysV;
+	IPC::SysV->import(qw(IPC_CREAT IPC_EXCL S_IRUSR S_IWUSR));
+};
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	$node->pgbench(
+		'--no-vacuum --client=10 --transactions=1000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			'001_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			'002_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			# Ensure some HOT updates happen
+			'003_pgbench_concurrent_transaction_updates' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  )
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	my $pg_bench_fork_flag;
+	while (1) {
+		shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+		sleep(0.1);
+		last if $pg_bench_fork_flag eq "stop";
+	}
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr, $n, $stderr_saved);
+		$n = 0;
+
+		$node->psql('postgres', q(CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+                                  LANGUAGE plpgsql AS $$
+                                  BEGIN
+                                    EXECUTE 'SELECT txid_current()';
+                                    RETURN true;
+                                  END; $$;));
+
+		$node->psql('postgres', q(CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+                                  LANGUAGE plpgsql AS $$
+                                  BEGIN
+                                    RETURN MOD($1, 2) = 0;
+                                  END; $$;));
+		while (1)
+		{
+
+			if (int(rand(2)) == 0) {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=1);));
+			} else {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+			}
+			is($result, '0', 'ALTER TABLE is correct');
+
+			if (1)
+			{
+				($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+				is($result, '0', 'REINDEX is correct');
+
+				if ($result) {
+					diag($stderr);
+					BAIL_OUT($stderr);
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true, rootdescend => true, checkunique => true);));
+				is($result, '0', 'bt_index_check is correct');
+				if ($result)
+				{
+					diag($stderr);
+					BAIL_OUT($stderr);
+				} else {
+					diag('#reindex:)' . $n++);
+				}
+			}
+
+			if (1)
+			{
+				my $variant = int(rand(7));
+				my $sql;
+				if ($variant == 0) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at););
+				} elsif ($variant == 1) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable(););
+				} elsif ($variant == 2) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;);
+				} elsif ($variant == 3) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i););
+				} elsif ($variant == 4) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i)););
+				} elsif ($variant == 5) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i););
+				} elsif ($variant == 6) {
+					$sql = q(CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+				} else { diag("#wrong variant"); }
+
+				diag('#' . $sql);
+				($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+				is($result, '0', 'CREATE INDEX is correct');
+				$stderr_saved = $stderr;
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+				is($result, '0', 'bt_index_check for new index is correct');
+				if ($result)
+				{
+					diag($stderr);
+					diag($stderr_saved);
+					BAIL_OUT($stderr);
+				} else {
+					diag('#create:)' . $n++);
+				}
+
+				if (1)
+				{
+					($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+					is($result, '0', 'REINDEX 2 is correct');
+					if ($result) {
+						diag($stderr);
+						BAIL_OUT($stderr);
+					}
+
+					($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+					is($result, '0', 'bt_index_check 2 is correct');
+					if ($result)
+					{
+						diag($stderr);
+						BAIL_OUT($stderr);
+					} else {
+						diag('#reindex2:)' . $n++);
+					}
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+				is($result, '0', 'DROP INDEX is correct');
+			}
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+	$child->finalize();
+	$child->summary();
+	$node->stop;
+	done_testing();
+
+	shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+	my ($psql, $query, $untl) = @_;
+	my $ret;
+
+	# For each query we run, we'll restart the timeout.  Otherwise the timeout
+	# would apply to the whole test script, and would need to be set very high
+	# to survive when running under Valgrind.
+	$psql_timeout->reset();
+	$psql_timeout->start();
+
+	# send query
+	$$psql{stdin} .= $query;
+	$$psql{stdin} .= "\n";
+
+	# wait for query results
+	$$psql{run}->pump_nb();
+	while (1)
+	{
+		last if $$psql{stdout} =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		if (not $$psql{run}->pumpable())
+		{
+			diag("aborting wait: program died\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		$$psql{run}->pump();
+	}
+
+	$$psql{stdout} = '';
+
+	return 1;
+}
diff --git a/src/bin/pg_amcheck/t/007_concurrently_unique.pl b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
new file mode 100644
index 00000000000..22cd3b4bf2b
--- /dev/null
+++ b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
@@ -0,0 +1,239 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use threads;
+use Test::More;
+use Test::Builder;
+
+eval {
+	require IPC::SysV;
+	IPC::SysV->import(qw(IPC_CREAT IPC_EXCL S_IRUSR S_IWUSR));
+};
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 128MB');
+$node->append_conf('postgresql.conf', 'shared_buffers = 256MB');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i, updated_at)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	# $node->psql('postgres', q(INSERT INTO tbl SELECT i,0,0,0,now() FROM generate_series(1, 1000) s(i);));
+	# while [ $? -eq 0 ]; do make -C src/bin/pg_amcheck/ check PROVE_TESTS='t/007_*' ; done
+
+	$node->pgbench(
+		'--no-vacuum --client=40 --exit-on-abort --transactions=1000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			# Ensure some HOT updates happen
+			'001_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+			),
+			'002_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*100,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+			'003_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+			'004_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	my $pg_bench_fork_flag;
+	while (1) {
+		shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+		sleep(0.1);
+		last if $pg_bench_fork_flag eq "stop";
+	}
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr, $n, $stderr_saved);
+
+#		ok(send_query_and_wait(\%psql, q[SELECT pg_sleep(10);], qr/^.*$/m), 'SELECT');
+
+		while (1)
+		{
+
+			if (int(rand(2)) == 0) {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+			} else {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=0);));
+			}
+			is($result, '0', 'ALTER TABLE is correct');
+
+
+			if (1)
+			{
+				my $sql = q(select pg_sleep(0); CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+
+				($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+				is($result, '0', 'CREATE INDEX is correct');
+				$stderr_saved = $stderr;
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+				is($result, '0', 'bt_index_check for new index is correct');
+				if ($result)
+				{
+					diag($stderr);
+					diag($stderr_saved);
+					BAIL_OUT($stderr);
+				} else {
+					diag('#create:)' . $n++);
+				}
+
+				if (1)
+				{
+					($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+					is($result, '0', 'REINDEX 2 is correct');
+					if ($result) {
+						diag($stderr);
+						BAIL_OUT($stderr);
+					}
+
+					($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_parent_check('idx_2', heapallindexed => true, rootdescend => true, checkunique => true);));
+					is($result, '0', 'bt_index_check 2 is correct');
+					if ($result)
+					{
+						diag($stderr);
+						BAIL_OUT($stderr);
+					} else {
+						diag('#reindex2:)' . $n++);
+					}
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+				is($result, '0', 'DROP INDEX is correct');
+			}
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+	$child->finalize();
+	$child->summary();
+	$node->stop;
+	done_testing();
+
+	shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+	my ($psql, $query, $untl) = @_;
+	my $ret;
+
+	# For each query we run, we'll restart the timeout.  Otherwise the timeout
+	# would apply to the whole test script, and would need to be set very high
+	# to survive when running under Valgrind.
+	$psql_timeout->reset();
+	$psql_timeout->start();
+
+	# send query
+	$$psql{stdin} .= $query;
+	$$psql{stdin} .= "\n";
+
+	# wait for query results
+	$$psql{run}->pump_nb();
+	while (1)
+	{
+		last if $$psql{stdout} =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		if (not $$psql{run}->pumpable())
+		{
+			diag("aborting wait: program died\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		$$psql{run}->pump();
+	}
+
+	$$psql{stdout} = '';
+
+	return 1;
+}
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-12-09 20:53  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-12-09 20:53 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, Matthias!

Added support for unique indexes.

So, now your initial idea about resetting during the first phase appears to
be ready.

Next step - use single-scan and auxiliary index for concurrent index build.

Also, I have updated the stress tests accordingly to [0].

[0]:
https://www.postgresql.org/message-id/flat/CANtu0ojmVd27fEhfpST7RG2KZvwkX%3DdMyKUqg0KM87FkOSdz8Q%40m...

Best regards,
Mikhail.


Attachments:

  [application/x-patch] v5-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patch (32.5K, 3-v5-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patch)
  download | inline diff:
From e7d31801aac57f2e0bfc6bfc209be89eb90c75e9 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v5 5/5] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 173 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   6 +-
 src/backend/utils/sort/tuplesortvariants.c    |  67 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 11 files changed, 242 insertions(+), 75 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2e5163609c1..921b806642a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2acbf121745..ac9e5acfc53 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	bool		wait_for_snapshot_attach;
 	int			querylen;
 
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case when leader going to reset own active snapshot as well - we need to
 	 * wait until all workers imported initial snapshot.
 	 */
-	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
 
 	if (wait_for_snapshot_attach)
 		WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 50cbf06cb45..3d6dda4ace8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4672,7 +4670,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4790,17 +4788,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4826,6 +4831,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4845,7 +4852,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4856,7 +4863,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4865,6 +4873,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4873,7 +4883,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4890,6 +4901,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f581a743aae..6242b242940 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3292,9 +3292,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9ee5ea15fd4..ec3769585c3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1803,9 +1803,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 49ef68d9071..c8e4683ad6d 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/x-patch] v5-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch (35.8K, 4-v5-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch)
  download | inline diff:
From 54e755b2d097753f65e14c4aafd5718e0cb457f8 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v5 3/5] Allow advancing xmin during non-unique, non-parallel 
 concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  14 +++
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  14 +++
 src/backend/catalog/index.c                   |  30 +++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 102 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  82 ++++++++++++++
 15 files changed, 375 insertions(+), 31 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 60c61039d66..777df91972e 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -461,7 +461,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1c3a9e06d37..f581a743aae 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b665a7762ec..d9de16af81d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6942,6 +6943,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6997,6 +6999,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7054,6 +7061,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..4cfbbb05923
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,102 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..4fef5a47431
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,82 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/x-patch] v5-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch (61.5K, 5-v5-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch)
  download | inline diff:
From 9432da61d7640457a67cc5ac8ecd0b1c6be132e1 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v5 1/5] this is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 ++++++++-
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++---
 src/backend/utils/time/snapmgr.c              |   2 +
 src/test/modules/injection_points/Makefile    |   7 +-
 .../expected/index_concurrently_upsert.out    |  80 ++++++
 .../index_concurrently_upsert_predicate.out   |  80 ++++++
 .../expected/reindex_concurrently_upsert.out  | 238 ++++++++++++++++++
 ...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
 ...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |  11 +
 .../specs/index_concurrently_upsert.spec      |  68 +++++
 .../index_concurrently_upsert_predicate.spec  |  70 ++++++
 .../specs/reindex_concurrently_upsert.spec    |  86 +++++++
 ...dex_concurrently_upsert_on_constraint.spec |  86 +++++++
 ...index_concurrently_upsert_partitioned.spec |  88 +++++++
 18 files changed, 1505 insertions(+), 50 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 37b0ca2e439..5ffef4595e2 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -713,12 +713,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -753,8 +755,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -766,30 +768,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -812,7 +860,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -832,27 +886,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -872,7 +922,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -880,6 +930,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -917,27 +971,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -945,7 +1007,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 7d2b34d4f20..3a7357a050d 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -426,6 +427,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
 REGRESS = injection_points reindex_conc
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+			reindex_concurrently_upsert \
+			index_concurrently_upsert \
+			reindex_concurrently_upsert_partitioned \
+			reindex_concurrently_upsert_on_constraint \
+			index_concurrently_upsert_predicate
 
 TAP_TESTS = 1
 
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
     'specs': [
       'basic',
       'inplace',
+      'reindex_concurrently_upsert',
+      'index_concurrently_upsert',
+      'reindex_concurrently_upsert_partitioned',
+      'reindex_concurrently_upsert_on_constraint',
+      'index_concurrently_upsert_predicate',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
+    # We waiting for all snapshots, so, avoid parallel test executions
+    'runningcheck-parallel': false,
   },
   'tap': {
     'env': {
@@ -53,5 +62,7 @@ tests += {
     'tests': [
       't/001_stats.pl',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
   },
 }
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+	CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+	CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+		FOR VALUES FROM (0) TO (10000)
+		WITH (parallel_workers = 0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
-- 
2.43.0



  [application/x-patch] v5-0002-Add-stress-tests-for-concurrent-index-operations.patch (20.2K, 6-v5-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 836cb845682460d8967dfbf2826f4c237d6be4e1 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v5 2/5] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck bt_index_parent_check
* Exercising parallel worker configurations

The tests perform intensive concurrent modifications via pgbench while
executing index operations to stress test index build infrastructure.
Test cases cover:
- Regular and unique indexes
- Indexes with stable and immutable predicates
- Multi-column indexes with various combinations
- Different parallel worker configurations

Two new test files added:
- t/006_concurrently.pl: General concurrent index operation tests
- t/007_concurrently_unique.pl: Focused testing of unique indexes

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build                |   2 +
 src/bin/pg_amcheck/t/006_concurrently.pl      | 315 ++++++++++++++++++
 .../pg_amcheck/t/007_concurrently_unique.pl   | 239 +++++++++++++
 3 files changed, 556 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl
 create mode 100644 src/bin/pg_amcheck/t/007_concurrently_unique.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..b4e14a15ef3 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,8 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_concurrently.pl',
+      't/007_concurrently_unique.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_concurrently.pl b/src/bin/pg_amcheck/t/006_concurrently.pl
new file mode 100644
index 00000000000..e13a340e777
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_concurrently.pl
@@ -0,0 +1,315 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+
+use threads;
+use Test::More;
+use Test::Builder;
+
+
+eval {
+	require IPC::SysV;
+	IPC::SysV->import(qw(IPC_CREAT IPC_EXCL S_IRUSR S_IWUSR));
+};
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	$node->pgbench(
+		'--no-vacuum --client=10 --transactions=1000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			'001_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			'002_pgbench_concurrent_transaction_inserts' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  ),
+			# Ensure some HOT updates happen
+			'003_pgbench_concurrent_transaction_updates' => q(
+				BEGIN;
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
+					on conflict(i) do update set updated_at = now();
+				COMMIT;
+			  )
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	my $pg_bench_fork_flag;
+	while (1) {
+		shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+		sleep(0.1);
+		last if $pg_bench_fork_flag eq "stop";
+	}
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr, $n, $stderr_saved);
+		$n = 0;
+
+		$node->psql('postgres', q(CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+                                  LANGUAGE plpgsql AS $$
+                                  BEGIN
+                                    EXECUTE 'SELECT txid_current()';
+                                    RETURN true;
+                                  END; $$;));
+
+		$node->psql('postgres', q(CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+                                  LANGUAGE plpgsql AS $$
+                                  BEGIN
+                                    RETURN MOD($1, 2) = 0;
+                                  END; $$;));
+		while (1)
+		{
+
+			if (int(rand(2)) == 0) {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=0);));
+			} else {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+			}
+			is($result, '0', 'ALTER TABLE is correct');
+
+			if (1)
+			{
+				($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx;));
+				is($result, '0', 'REINDEX is correct');
+
+				if ($result) {
+					diag($stderr);
+					BAIL_OUT($stderr);
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_check('idx', heapallindexed => true, checkunique => true);));
+				is($result, '0', 'bt_index_check is correct');
+				if ($result)
+				{
+					diag($stderr);
+					BAIL_OUT($stderr);
+				} else {
+					diag('#reindex:)' . $n++);
+				}
+			}
+
+			if (1)
+			{
+				my $variant = int(rand(7));
+				my $sql;
+				if ($variant == 0) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at););
+				} elsif ($variant == 1) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable(););
+				} elsif ($variant == 2) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;);
+				} elsif ($variant == 3) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i););
+				} elsif ($variant == 4) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i)););
+				} elsif ($variant == 5) {
+					$sql = q(CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i););
+				} elsif ($variant == 6) {
+					$sql = q(CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+				} else { diag("#wrong variant"); }
+
+				diag('#' . $sql);
+				($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+				is($result, '0', 'CREATE INDEX is correct');
+				$stderr_saved = $stderr;
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);));
+				is($result, '0', 'bt_index_check for new index is correct');
+				if ($result)
+				{
+					diag($stderr);
+					diag($stderr_saved);
+					BAIL_OUT($stderr);
+				} else {
+					diag('#create:)' . $n++);
+				}
+
+				if (1)
+				{
+					($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+					is($result, '0', 'REINDEX 2 is correct');
+					if ($result) {
+						diag($stderr);
+						BAIL_OUT($stderr);
+					}
+
+					($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);));
+					is($result, '0', 'bt_index_check 2 is correct');
+					if ($result)
+					{
+						diag($stderr);
+						BAIL_OUT($stderr);
+					} else {
+						diag('#reindex2:)' . $n++);
+					}
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+				is($result, '0', 'DROP INDEX is correct');
+			}
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+	$child->finalize();
+	$child->summary();
+
+	shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+	waitpid($pid,0);
+	done_testing();
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+	my ($psql, $query, $untl) = @_;
+	my $ret;
+
+	# For each query we run, we'll restart the timeout.  Otherwise the timeout
+	# would apply to the whole test script, and would need to be set very high
+	# to survive when running under Valgrind.
+	$psql_timeout->reset();
+	$psql_timeout->start();
+
+	# send query
+	$$psql{stdin} .= $query;
+	$$psql{stdin} .= "\n";
+
+	# wait for query results
+	$$psql{run}->pump_nb();
+	while (1)
+	{
+		last if $$psql{stdout} =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		if (not $$psql{run}->pumpable())
+		{
+			diag("aborting wait: program died\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		$$psql{run}->pump();
+	}
+
+	$$psql{stdout} = '';
+
+	return 1;
+}
diff --git a/src/bin/pg_amcheck/t/007_concurrently_unique.pl b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
new file mode 100644
index 00000000000..67e2be3e33f
--- /dev/null
+++ b/src/bin/pg_amcheck/t/007_concurrently_unique.pl
@@ -0,0 +1,239 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use threads;
+use Test::More;
+use Test::Builder;
+
+eval {
+	require IPC::SysV;
+	IPC::SysV->import(qw(IPC_CREAT IPC_EXCL S_IRUSR S_IWUSR));
+};
+
+if ($@ || $windows_os)
+{
+	plan skip_all => 'Fork and shared memory are not supported by this platform';
+}
+
+# TODO: refactor to https://metacpan.org/pod/IPC%3A%3AShareable
+my ($pid, $shmem_id, $shmem_key,  $shmem_size);
+eval 'sub IPC_CREAT {0001000}' unless defined &IPC_CREAT;
+$shmem_size = 4;
+$shmem_key = rand(1000000);
+$shmem_id = shmget($shmem_key, $shmem_size, &IPC_CREAT | 0777) or die "Can't shmget: $!";
+shmwrite($shmem_id, "wait", 0, $shmem_size) or die "Can't shmwrite: $!";
+
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+#
+# Test set-up
+#
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test_unique');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 128MB');
+$node->append_conf('postgresql.conf', 'shared_buffers = 256MB');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX idx ON tbl(i, updated_at)));
+
+my $builder = Test::More->builder;
+$builder->use_numbers(0);
+$builder->no_plan();
+
+my $child  = $builder->child("pg_bench");
+
+if(!defined($pid = fork())) {
+	# fork returned undef, so unsuccessful
+	die "Cannot fork a child: $!";
+} elsif ($pid == 0) {
+
+	# $node->psql('postgres', q(INSERT INTO tbl SELECT i,0,0,0,now() FROM generate_series(1, 1000) s(i);));
+	# while [ $? -eq 0 ]; do make -C src/bin/pg_amcheck/ check PROVE_TESTS='t/007_*' ; done
+
+	$node->pgbench(
+		'--no-vacuum --client=40 --exit-on-abort --transactions=1000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and RC',
+		{
+			# Ensure some HOT updates happen
+			'001_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*1000,0,0,0,now()) on conflict(i) do update set updated_at = date_trunc('seconds', now());
+			),
+			'002_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*100,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+			'003_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*10000,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+			'004_pgbench_concurrent_transaction_updates' => q(
+				INSERT INTO tbl VALUES(random()*100000,0,0,0,now()) on conflict(i)  do update set updated_at = date_trunc('seconds', now());
+			),
+		});
+
+	if ($child->is_passing()) {
+		shmwrite($shmem_id, "done", 0, $shmem_size) or die "Can't shmwrite: $!";
+	} else {
+		shmwrite($shmem_id, "fail", 0, $shmem_size) or die "Can't shmwrite: $!";
+	}
+
+	my $pg_bench_fork_flag;
+	while (1) {
+		shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+		sleep(0.1);
+		last if $pg_bench_fork_flag eq "stop";
+	}
+} else {
+	my $pg_bench_fork_flag;
+	shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+
+	subtest 'reindex run subtest' => sub {
+		is($pg_bench_fork_flag, "wait", "pg_bench_fork_flag is correct");
+
+		my %psql = (stdin => '', stdout => '', stderr => '');
+		$psql{run} = IPC::Run::start(
+			[ 'psql', '-XA', '-f', '-', '-d', $node->connstr('postgres') ],
+			'<',
+			\$psql{stdin},
+			'>',
+			\$psql{stdout},
+			'2>',
+			\$psql{stderr},
+			$psql_timeout);
+
+		my ($result, $stdout, $stderr, $n, $stderr_saved);
+
+#		ok(send_query_and_wait(\%psql, q[SELECT pg_sleep(10);], qr/^.*$/m), 'SELECT');
+
+		while (1)
+		{
+
+			if (int(rand(2)) == 0) {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=4);));
+			} else {
+				($result, $stdout, $stderr) = $node->psql('postgres', q(ALTER TABLE tbl SET (parallel_workers=0);));
+			}
+			is($result, '0', 'ALTER TABLE is correct');
+
+
+			if (1)
+			{
+				my $sql = q(select pg_sleep(0); CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i););
+
+				($result, $stdout, $stderr) = $node->psql('postgres', $sql);
+				is($result, '0', 'CREATE INDEX is correct');
+				$stderr_saved = $stderr;
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);));
+				is($result, '0', 'bt_index_check for new index is correct');
+				if ($result)
+				{
+					diag($stderr);
+					diag($stderr_saved);
+					BAIL_OUT($stderr);
+				} else {
+					diag('#create:)' . $n++);
+				}
+
+				if (1)
+				{
+					($result, $stdout, $stderr) = $node->psql('postgres', q(REINDEX INDEX CONCURRENTLY idx_2;));
+					is($result, '0', 'REINDEX 2 is correct');
+					if ($result) {
+						diag($stderr);
+						BAIL_OUT($stderr);
+					}
+
+					($result, $stdout, $stderr) = $node->psql('postgres', q(SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);));
+					is($result, '0', 'bt_index_check 2 is correct');
+					if ($result)
+					{
+						diag($stderr);
+						BAIL_OUT($stderr);
+					} else {
+						diag('#reindex2:)' . $n++);
+					}
+				}
+
+				($result, $stdout, $stderr) = $node->psql('postgres', q(DROP INDEX CONCURRENTLY idx_2;));
+				is($result, '0', 'DROP INDEX is correct');
+			}
+			shmread($shmem_id, $pg_bench_fork_flag, 0, $shmem_size) or die "Can't shmread: $!";
+			last if $pg_bench_fork_flag ne "wait";
+		}
+
+		# explicitly shut down psql instances gracefully
+        $psql{stdin} .= "\\q\n";
+        $psql{run}->finish;
+
+		is($pg_bench_fork_flag, "done", "pg_bench_fork_flag is correct");
+	};
+
+	$child->finalize();
+	$child->summary();
+
+	shmwrite($shmem_id, "stop", 0, $shmem_size) or die "Can't shmwrite: $!";
+    waitpid($pid,0);
+    done_testing();
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+	my ($psql, $query, $untl) = @_;
+	my $ret;
+
+	# For each query we run, we'll restart the timeout.  Otherwise the timeout
+	# would apply to the whole test script, and would need to be set very high
+	# to survive when running under Valgrind.
+	$psql_timeout->reset();
+	$psql_timeout->start();
+
+	# send query
+	$$psql{stdin} .= $query;
+	$$psql{stdin} .= "\n";
+
+	# wait for query results
+	$$psql{run}->pump_nb();
+	while (1)
+	{
+		last if $$psql{stdout} =~ /$untl/;
+		if ($psql_timeout->is_expired)
+		{
+			diag("aborting wait: program timed out\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		if (not $$psql{run}->pumpable())
+		{
+			diag("aborting wait: program died\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		$$psql{run}->pump();
+	}
+
+	$$psql{stdout} = '';
+
+	return 1;
+}
-- 
2.43.0



  [application/x-patch] v5-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch (29.2K, 7-v5-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch)
  download | inline diff:
From d435fe63303485e68e197b3dc6e571065eb6863b Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v5 4/5] Allow snapshot resets during parallel concurrent index
 builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
  proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 43 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 +++--
 src/backend/access/nbtree/nbtsort.c           | 38 ++++++++++++--
 src/backend/access/table/tableam.c            | 37 ++++++++++++--
 src/backend/access/transam/parallel.c         | 50 +++++++++++++++++--
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 ++--
 .../expected/cic_reset_snapshots.out          | 23 ++++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 12 files changed, 178 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3a7357a050d..148e1982cad 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -291,14 +291,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 4cfbbb05923..49ef68d9071 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,27 +78,40 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 4fef5a47431..5d1c31493f0 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -79,4 +82,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-12-17 23:29  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-12-17 23:29 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello!

After [0] fix, I simplified stress tests to single pgbench run without any
forks.

[0]: https://commitfest.postgresql.org/51/5439/

>


Attachments:

  [application/octet-stream] v6-0005-Allow-snapshot-resets-during-parallel-concurrent-.patch (29.2K, 3-v6-0005-Allow-snapshot-resets-during-parallel-concurrent-.patch)
  download | inline diff:
From 15d61bbb64e5f8e418594d1ea6b50ceb9c65d9d1 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v6 5/6] Allow snapshot resets during parallel concurrent index
 builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
  proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 43 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 +++--
 src/backend/access/nbtree/nbtsort.c           | 38 ++++++++++++--
 src/backend/access/table/tableam.c            | 37 ++++++++++++--
 src/backend/access/transam/parallel.c         | 50 +++++++++++++++++--
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 ++--
 .../expected/cic_reset_snapshots.out          | 23 ++++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 12 files changed, 178 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2189bf0d9ae..b3cc7a2c150 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -287,14 +287,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 4cfbbb05923..49ef68d9071 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,27 +78,40 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 4fef5a47431..5d1c31493f0 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -79,4 +82,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v6-0004-Allow-advancing-xmin-during-non-unique-non-parall.patch (35.8K, 4-v6-0004-Allow-advancing-xmin-during-non-unique-non-parall.patch)
  download | inline diff:
From e85b568a1a8d39ab24bd21bef90d546fce61a726 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v6 4/6] Allow advancing xmin during non-unique, non-parallel 
 concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  14 +++
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  14 +++
 src/backend/catalog/index.c                   |  30 +++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 102 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  82 ++++++++++++++
 15 files changed, 375 insertions(+), 31 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4b4ebff6a17..a104ba9df74 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 05dc6add7eb..e0ada5ce159 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f3856c519f6..5c7514c96ac 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6779,6 +6780,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6834,6 +6836,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6891,6 +6898,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..4cfbbb05923
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,102 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..4fef5a47431
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,82 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v6-0002-this-is-https-commitfest.postgresql.org-50-5160-m.patch (61.5K, 5-v6-0002-this-is-https-commitfest.postgresql.org-50-5160-m.patch)
  download | inline diff:
From 12efb82206cee7843bf17ccabacc91435d0bac5a Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v6 2/6] this is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 ++++++++-
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++---
 src/backend/utils/time/snapmgr.c              |   2 +
 src/test/modules/injection_points/Makefile    |   7 +-
 .../expected/index_concurrently_upsert.out    |  80 ++++++
 .../index_concurrently_upsert_predicate.out   |  80 ++++++
 .../expected/reindex_concurrently_upsert.out  | 238 ++++++++++++++++++
 ...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
 ...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |  11 +
 .../specs/index_concurrently_upsert.spec      |  68 +++++
 .../index_concurrently_upsert_predicate.spec  |  70 ++++++
 .../specs/reindex_concurrently_upsert.spec    |  86 +++++++
 ...dex_concurrently_upsert_on_constraint.spec |  86 +++++++
 ...index_concurrently_upsert_partitioned.spec |  88 +++++++
 18 files changed, 1505 insertions(+), 50 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 153390f2dc9..56b58d1ed74 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index a1a0c2adeb6..2189bf0d9ae 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -392,6 +393,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
 REGRESS = injection_points reindex_conc
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+			reindex_concurrently_upsert \
+			index_concurrently_upsert \
+			reindex_concurrently_upsert_partitioned \
+			reindex_concurrently_upsert_on_constraint \
+			index_concurrently_upsert_predicate
 
 TAP_TESTS = 1
 
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
     'specs': [
       'basic',
       'inplace',
+      'reindex_concurrently_upsert',
+      'index_concurrently_upsert',
+      'reindex_concurrently_upsert_partitioned',
+      'reindex_concurrently_upsert_on_constraint',
+      'index_concurrently_upsert_predicate',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
+    # We waiting for all snapshots, so, avoid parallel test executions
+    'runningcheck-parallel': false,
   },
   'tap': {
     'env': {
@@ -53,5 +62,7 @@ tests += {
     'tests': [
       't/001_stats.pl',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
   },
 }
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+	CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+	CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+		FOR VALUES FROM (0) TO (10000)
+		WITH (parallel_workers = 0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v6-0003-Add-stress-tests-for-concurrent-index-operations.patch (6.5K, 6-v6-0003-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 212a59c454c7584f1b020e9b847da5bd86e22f56 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v6 3/6] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 144 ++++++++++++++++++++++++++++++++
 2 files changed, 145 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..4a8f4fbc8b0 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..002348b8366
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,144 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					--\sleep 200 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					--\sleep 200 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					--\sleep 200 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					--\sleep 200 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v6-0006-Allow-snapshot-resets-in-concurrent-unique-index-.patch (32.5K, 7-v6-0006-Allow-snapshot-resets-in-concurrent-unique-index-.patch)
  download | inline diff:
From dc8447015383a3c38c71570749b697b25c7aceb7 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v6 6/6] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 173 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   6 +-
 src/backend/utils/sort/tuplesortvariants.c    |  67 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 11 files changed, 242 insertions(+), 75 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2e5163609c1..921b806642a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2acbf121745..ac9e5acfc53 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	bool		wait_for_snapshot_attach;
 	int			querylen;
 
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case when leader going to reset own active snapshot as well - we need to
 	 * wait until all workers imported initial snapshot.
 	 */
-	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
 
 	if (wait_for_snapshot_attach)
 		WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 50cbf06cb45..3d6dda4ace8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4672,7 +4670,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4790,17 +4788,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4826,6 +4831,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4845,7 +4852,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4856,7 +4863,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4865,6 +4873,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4873,7 +4883,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4890,6 +4901,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e0ada5ce159..f6a1a2f3f90 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3292,9 +3292,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9ee5ea15fd4..ec3769585c3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1803,9 +1803,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 49ef68d9071..c8e4683ad6d 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-12-21 18:00  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-12-21 18:00 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello!

Added STIR access method, next step is validating indexes using it.

Best regards,
Mikhail.

>


Attachments:

  [application/octet-stream] v7-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch (61.5K, 3-v7-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch)
  download | inline diff:
From 12efb82206cee7843bf17ccabacc91435d0bac5a Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v7 1/6] this is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 ++++++++-
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++---
 src/backend/utils/time/snapmgr.c              |   2 +
 src/test/modules/injection_points/Makefile    |   7 +-
 .../expected/index_concurrently_upsert.out    |  80 ++++++
 .../index_concurrently_upsert_predicate.out   |  80 ++++++
 .../expected/reindex_concurrently_upsert.out  | 238 ++++++++++++++++++
 ...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
 ...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |  11 +
 .../specs/index_concurrently_upsert.spec      |  68 +++++
 .../index_concurrently_upsert_predicate.spec  |  70 ++++++
 .../specs/reindex_concurrently_upsert.spec    |  86 +++++++
 ...dex_concurrently_upsert_on_constraint.spec |  86 +++++++
 ...index_concurrently_upsert_partitioned.spec |  88 +++++++
 18 files changed, 1505 insertions(+), 50 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 153390f2dc9..56b58d1ed74 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index a1a0c2adeb6..2189bf0d9ae 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -392,6 +393,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
 REGRESS = injection_points reindex_conc
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+			reindex_concurrently_upsert \
+			index_concurrently_upsert \
+			reindex_concurrently_upsert_partitioned \
+			reindex_concurrently_upsert_on_constraint \
+			index_concurrently_upsert_predicate
 
 TAP_TESTS = 1
 
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
     'specs': [
       'basic',
       'inplace',
+      'reindex_concurrently_upsert',
+      'index_concurrently_upsert',
+      'reindex_concurrently_upsert_partitioned',
+      'reindex_concurrently_upsert_on_constraint',
+      'index_concurrently_upsert_predicate',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
+    # We waiting for all snapshots, so, avoid parallel test executions
+    'runningcheck-parallel': false,
   },
   'tap': {
     'env': {
@@ -53,5 +62,7 @@ tests += {
     'tests': [
       't/001_stats.pl',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
   },
 }
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+	CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+	CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+		FOR VALUES FROM (0) TO (10000)
+		WITH (parallel_workers = 0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v7-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch (36.3K, 4-v7-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch)
  download | inline diff:
From 452ef7089db779a08421a1084584c13c599d1320 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v7 3/6] Allow advancing xmin during non-unique, non-parallel 
 concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  14 +++
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  14 +++
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 107 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 15 files changed, 384 insertions(+), 31 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4b4ebff6a17..a104ba9df74 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 05dc6add7eb..e0ada5ce159 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f3856c519f6..5c7514c96ac 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6779,6 +6780,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6834,6 +6836,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6891,6 +6898,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..5db54530f17
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,107 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v7-0002-Add-stress-tests-for-concurrent-index-operations.patch (6.5K, 5-v7-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From b4f22a1da4bbbff6a268c0f62196a264cb126896 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v7 2/6] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 144 ++++++++++++++++++++++++++++++++
 2 files changed, 145 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..4a8f4fbc8b0 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..142e8fb845e
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,144 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v7-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch (29.5K, 6-v7-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch)
  download | inline diff:
From 1a2a8cc969011974913c22604d608a0d9c4ffa78 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v7 4/6] Allow snapshot resets during parallel concurrent index
 builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
  proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 43 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 +++--
 src/backend/access/nbtree/nbtsort.c           | 38 ++++++++++++--
 src/backend/access/table/tableam.c            | 37 ++++++++++++--
 src/backend/access/transam/parallel.c         | 50 +++++++++++++++++--
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 ++--
 .../expected/cic_reset_snapshots.out          | 23 ++++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 12 files changed, 178 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2189bf0d9ae..b3cc7a2c150 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -287,14 +287,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 5db54530f17..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,24 +78,35 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -97,7 +114,9 @@ REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v7-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patch (32.5K, 7-v7-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patch)
  download | inline diff:
From f48e59a4b33a4b05e2f08dedadfce8628a8ae094 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v7 5/6] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 173 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   6 +-
 src/backend/utils/sort/tuplesortvariants.c    |  67 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 11 files changed, 242 insertions(+), 75 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2e5163609c1..921b806642a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2acbf121745..ac9e5acfc53 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	bool		wait_for_snapshot_attach;
 	int			querylen;
 
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case when leader going to reset own active snapshot as well - we need to
 	 * wait until all workers imported initial snapshot.
 	 */
-	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
 
 	if (wait_for_snapshot_attach)
 		WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 50cbf06cb45..3d6dda4ace8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4672,7 +4670,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4790,17 +4788,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4826,6 +4831,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4845,7 +4852,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4856,7 +4863,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4865,6 +4873,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4873,7 +4883,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4890,6 +4901,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e0ada5ce159..f6a1a2f3f90 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3292,9 +3292,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9ee5ea15fd4..ec3769585c3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1803,9 +1803,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v7-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patch (37.3K, 8-v7-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patch)
  download | inline diff:
From ccad95c4c080d0a73d7e5c1458fde825b559f9fe Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v7 6/6] Add STIR (Short-Term Index Replacement) access method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 576 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   3 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 779 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f2ca9430581..bec79b48cb2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 62a371db7f7..63ee0ef134d 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..83aa255176f
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,576 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	/* Initialize contents of meta page */
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+	GenericXLogFinish(state);
+
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	GenericXLogState *state;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			state = GenericXLogStart(index);
+			page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				GenericXLogFinish(state);
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			/* Didn't fit, must try other pages */
+			GenericXLogAbort(state);
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		state = GenericXLogStart(index);
+		metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again /
+			 */
+			GenericXLogAbort(state);
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+
+			page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+			GenericXLogFinish(state);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	metaData = StirPageGetMeta(metaPage);
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		GenericXLogFinish(state);
+	}
+	else
+	{
+		GenericXLogAbort(state);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f6a1a2f3f90..82816580e3c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3402,6 +3402,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282f..d54d310ba43 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a564..e4327b4f7dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 7e5df7bea4d..44a8a1f2875 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81653febc18..194dbbe1d0e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index df6923c9d50..0966397d344 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index db874902820..51350df0bf0 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index f503c652ebc..7067452a035 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,7 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', opcname => 'stir_ops', opcfamily => 'stir/any_ops',
+  opcintype => 'any' },
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index c8ac8c73def..41ea0c3ca50 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0f22c217235..59f50e2b027 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7f71b7625df..748655fd0cf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index a41cd2b7fd9..61f3d3dea0c 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-12-24 13:06  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 2 replies; 33+ messages in thread

From: Michail Nikolaev @ 2024-12-24 13:06 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello!

Now STIR used for validation (but without resetting of snapshot during
that phase for now).

Best regards,
Mikhail.

>


Attachments:

  [application/octet-stream] v8-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch (30.1K, 3-v8-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch)
  download | inline diff:
From 31b28f4a458da9486d7d851ee6a31f0241df074e Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v8 4/7] Allow snapshot resets during parallel concurrent index
 builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
  proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 43 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 +++--
 src/backend/access/nbtree/nbtsort.c           | 38 ++++++++++++--
 src/backend/access/table/tableam.c            | 37 ++++++++++++--
 src/backend/access/transam/parallel.c         | 50 +++++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 ++--
 .../expected/cic_reset_snapshots.out          | 23 ++++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 13 files changed, 179 insertions(+), 57 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d69859ac4df..0782bd64a6a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 980c51e32b9..2e5163609c1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5c4581afb1a..2acbf121745 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e0ada5ce159..f4464f64789 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1530,7 +1530,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2189bf0d9ae..b3cc7a2c150 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -287,14 +287,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..a9603084aeb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -88,6 +88,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f4c7d2a92bf..9ee5ea15fd4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1184,7 +1184,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1802,9 +1803,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 5db54530f17..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,24 +78,35 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -97,7 +114,9 @@ REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v8-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch (61.5K, 4-v8-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch)
  download | inline diff:
From 12efb82206cee7843bf17ccabacc91435d0bac5a Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v8 1/7] this is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 ++++++++-
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++---
 src/backend/utils/time/snapmgr.c              |   2 +
 src/test/modules/injection_points/Makefile    |   7 +-
 .../expected/index_concurrently_upsert.out    |  80 ++++++
 .../index_concurrently_upsert_predicate.out   |  80 ++++++
 .../expected/reindex_concurrently_upsert.out  | 238 ++++++++++++++++++
 ...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
 ...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |  11 +
 .../specs/index_concurrently_upsert.spec      |  68 +++++
 .../index_concurrently_upsert_predicate.spec  |  70 ++++++
 .../specs/reindex_concurrently_upsert.spec    |  86 +++++++
 ...dex_concurrently_upsert_on_constraint.spec |  86 +++++++
 ...index_concurrently_upsert_partitioned.spec |  88 +++++++
 18 files changed, 1505 insertions(+), 50 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76b..23cf4c6b540 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 153390f2dc9..56b58d1ed74 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index a1a0c2adeb6..2189bf0d9ae 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -392,6 +393,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
 REGRESS = injection_points reindex_conc
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+			reindex_concurrently_upsert \
+			index_concurrently_upsert \
+			reindex_concurrently_upsert_partitioned \
+			reindex_concurrently_upsert_on_constraint \
+			index_concurrently_upsert_predicate
 
 TAP_TESTS = 1
 
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
     'specs': [
       'basic',
       'inplace',
+      'reindex_concurrently_upsert',
+      'index_concurrently_upsert',
+      'reindex_concurrently_upsert_partitioned',
+      'reindex_concurrently_upsert_on_constraint',
+      'index_concurrently_upsert_predicate',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
+    # We waiting for all snapshots, so, avoid parallel test executions
+    'runningcheck-parallel': false,
   },
   'tap': {
     'env': {
@@ -53,5 +62,7 @@ tests += {
     'tests': [
       't/001_stats.pl',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
   },
 }
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+	CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+	CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+		FOR VALUES FROM (0) TO (10000)
+		WITH (parallel_workers = 0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v8-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patch (33.7K, 5-v8-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patch)
  download | inline diff:
From 3c82e0404db908491bd0ebaf1d177f9741c6c6ab Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v8 5/7] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 173 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  67 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 12 files changed, 245 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2e5163609c1..921b806642a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2acbf121745..ac9e5acfc53 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	bool		wait_for_snapshot_attach;
 	int			querylen;
 
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case when leader going to reset own active snapshot as well - we need to
 	 * wait until all workers imported initial snapshot.
 	 */
-	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
 
 	if (wait_for_snapshot_attach)
 		WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 50cbf06cb45..3d6dda4ace8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4672,7 +4670,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4790,17 +4788,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4826,6 +4831,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4845,7 +4852,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4856,7 +4863,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4865,6 +4873,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4873,7 +4883,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4890,6 +4901,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f4464f64789..4eec5525993 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1530,7 +1530,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3292,9 +3292,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6c1fce8ed25..a02729911fe 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,8 +1670,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9ee5ea15fd4..ec3769585c3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1803,9 +1803,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v8-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch (36.3K, 6-v8-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch)
  download | inline diff:
From 452ef7089db779a08421a1084584c13c599d1320 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v8 3/7] Allow advancing xmin during non-unique, non-parallel 
 concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  14 +++
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  14 +++
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 107 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 15 files changed, 384 insertions(+), 31 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3aedec882cd..d69859ac4df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..1fdfdf96482 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -566,6 +567,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -607,7 +638,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1233,6 +1270,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..980c51e32b9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4b4ebff6a17..a104ba9df74 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 17a352d040c..5c4581afb1a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 05dc6add7eb..e0ada5ce159 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1490,8 +1491,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1509,19 +1510,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1532,12 +1542,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3205,7 +3222,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3268,12 +3286,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f3856c519f6..5c7514c96ac 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6779,6 +6780,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6834,6 +6836,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6891,6 +6898,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..f4c7d2a92bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1779,6 +1801,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..5db54530f17
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,107 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v8-0002-Add-stress-tests-for-concurrent-index-operations.patch (6.5K, 7-v8-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From b4f22a1da4bbbff6a268c0f62196a264cb126896 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v8 2/7] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 144 ++++++++++++++++++++++++++++++++
 2 files changed, 145 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..4a8f4fbc8b0 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..142e8fb845e
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,144 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v8-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patch (37.3K, 8-v8-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patch)
  download | inline diff:
From 6f2d3ce069d5ccc738b3bacaa94759c13531030a Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v8 6/7] Add STIR (Short-Term Index Replacement) access method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 576 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 780 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f2ca9430581..bec79b48cb2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 62a371db7f7..63ee0ef134d 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..83aa255176f
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,576 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	/* Initialize contents of meta page */
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+	GenericXLogFinish(state);
+
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	GenericXLogState *state;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			state = GenericXLogStart(index);
+			page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				GenericXLogFinish(state);
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			/* Didn't fit, must try other pages */
+			GenericXLogAbort(state);
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		state = GenericXLogStart(index);
+		metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again /
+			 */
+			GenericXLogAbort(state);
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+
+			page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+			GenericXLogFinish(state);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	metaData = StirPageGetMeta(metaPage);
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		GenericXLogFinish(state);
+	}
+	else
+	{
+		GenericXLogAbort(state);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4eec5525993..92d5f3ac009 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3402,6 +3402,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282f..d54d310ba43 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a564..e4327b4f7dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 7e5df7bea4d..44a8a1f2875 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81653febc18..194dbbe1d0e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index df6923c9d50..0966397d344 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index db874902820..51350df0bf0 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index f503c652ebc..a8f0e66d15b 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index c8ac8c73def..41ea0c3ca50 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0f22c217235..59f50e2b027 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7f71b7625df..748655fd0cf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index a41cd2b7fd9..61f3d3dea0c 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v8-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-a.patch (76.3K, 9-v8-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-a.patch)
  download | inline diff:
From b6bb0dcc3598b51203ab89940f593f6cfbf6fe7a Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 24 Dec 2024 13:40:45 +0100
Subject: [PATCH v8 7/7] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track
  new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase
  instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 src/backend/access/heap/heapam_handler.c      | 384 +++++++++---------
 src/backend/catalog/index.c                   | 280 +++++++++++--
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 362 +++++++++++++----
 src/include/access/tableam.h                  |  28 +-
 src/include/catalog/index.h                   |  15 +-
 src/include/commands/progress.h               |   4 +-
 .../expected/cic_reset_snapshots.out          |  28 ++
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |   4 +
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/sql/create_index.sql         |   3 +
 12 files changed, 792 insertions(+), 323 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 921b806642a..d575083962b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,267 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	IndexFetchTableData *fetch;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
 
 	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL,
+					prev_indexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded,
+					prev_decoded,
+					fetched;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
+	/*
+	 * Now take the "reference snapshot" that will be used by to filter candidate
+	 * tuples.  Beware!  There might still be snapshots in
+	 * use that treat some transaction as in-progress that our reference
+	 * snapshot treats as committed.  If such a recently-committed transaction
+	 * deleted tuples in the table, we will not include them in the index; yet
+	 * those transactions which see the deleting one as still-in-progress will
+	 * expect such tuples to be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	fetch = heapam_index_fetch_begin(heapRelation);
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&prev_decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+	ItemPointerSetInvalid(&fetched);
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	/* We'll track the last "main" index position in prev_indexcursor. */
+	prev_indexcursor = &prev_decoded;
 
 	/*
-	 * Scan all tuples matching the snapshot.
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must be merged with or compared to those from
+	 * the "main" sort (state->tuplesort).
 	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while (!auxtuplesort_empty)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
-
+		Datum		ts_val;
+		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
-
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
-		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
-		}
-
 		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(auxState->tuplesort, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
 		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
+		else
 		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
+			auxindexcursor = NULL;
 		}
 
 		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
 			{
+				/* Keep track of the previous TID in prev_decoded. */
+				prev_decoded = decoded;
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+
+					/*
+					 * If the current TID in the main sort is a duplicate of the
+					 * previous one (prev_indexcursor), skip it to avoid
+					 * double-inserting the same TID. Such situation is possible
+					 * due concurrent page splits in btree (and, probabaly other
+					 * indexes as well).
+					 */
+					if (ItemPointerCompare(prev_indexcursor, indexcursor) == 0)
+					{
+						elog(DEBUG5, "skipping duplicate tid in target index snapshot: (%u,%u)",
+							 ItemPointerGetBlockNumber(indexcursor),
+							 ItemPointerGetOffsetNumber(indexcursor));
+					}
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
 			}
-		}
-
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
 
 			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
 			 */
-			if (predicate != NULL)
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
 			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
+				bool call_again = false;
+				bool all_dead = false;
+				ItemPointer tid;
 
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
+				/* Copy the auxindexcursor TID into fetched. */
+				fetched = *auxindexcursor;
+				tid = &fetched;
 
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				state->htups += 1;
 
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
+				/*
+				 * Fetch the tuple from the heap to see if it's visible
+				 * under our snapshot. If it is, form the index key values
+				 * and insert a new entry into the target index.
+				 */
+				if (heapam_index_fetch_tuple(fetch, tid, snapshot, slot, &call_again, &all_dead))
+				{
+
+					/* Compute the key values and null flags for this tuple. */
+					FormIndexDatum(indexInfo,
+								   slot,
+								   estate,
+								   values,
+								   isnull);
+
+					/*
+					 * Insert the tuple into the target index.
+					 */
+					index_insert(indexRelation,
+								 values,
+								 isnull,
+								 auxindexcursor, /* insert root tuple */
+								 heapRelation,
+								 indexInfo->ii_Unique ?
+								 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+								 false,
+								 indexInfo);
+
+					state->tups_inserted += 1;
+
+					elog(DEBUG5, "inserted tid: (%u,%u), root: (%u, %u)",
+											ItemPointerGetBlockNumber(auxindexcursor),
+											ItemPointerGetOffsetNumber(auxindexcursor),
+											ItemPointerGetBlockNumber(tid),
+											ItemPointerGetOffsetNumber(tid));
+				}
+				else
+				{
+					/*
+					 * The tuple wasn't visible under our snapshot. We
+					 * skip inserting it into the target index because
+					 * from our perspective, it doesn't exist.
+					 */
+					elog(DEBUG5, "skipping insert to target index because tid not visible: (%u,%u)",
+						 ItemPointerGetBlockNumber(auxindexcursor),
+						 ItemPointerGetOffsetNumber(auxindexcursor));
+				}
+			}
 		}
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	heapam_index_fetch_end(fetch);
+
+	/*
+	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.  But first, save the snapshot's xmin to use as
+	 * limitXmin for GetCurrentVirtualXIDs().
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 92d5f3ac009..f0389ef8583 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -718,6 +718,9 @@ UpdateIndexRelation(Oid indexoid,
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -742,7 +745,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -753,11 +757,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -783,7 +787,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -791,6 +794,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1461,7 +1469,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1471,6 +1480,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false, /* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false, /* aux are not summarizing */
+							oldInfo->ii_WithoutOverlaps);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -1482,7 +1639,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
  */
 void
 index_concurrently_build(Oid heapRelationId,
-						 Oid indexRelationId)
+						 Oid indexRelationId,
+						 bool auxiliary)
 {
 	Relation	heapRel;
 	Oid			save_userid;
@@ -1523,6 +1681,7 @@ index_concurrently_build(Oid heapRelationId,
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	indexInfo->ii_Auxiliary = auxiliary;
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
@@ -3275,12 +3434,20 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We wait again for all
+ * transactions that could have been modifying the table to terminate. At that
+ * moment all new tuples are going to be inserted into auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3291,6 +3458,7 @@ IndexCheckExclusion(Relation heapRelation,
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
+ * But theese tuples contained in auxiliary index.
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
@@ -3300,8 +3468,10 @@ IndexCheckExclusion(Relation heapRelation,
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At that moment we clear "indisready" for
+ * auxiliary index, since it is no more required/
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3309,12 +3479,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3330,24 +3502,25 @@ IndexCheckExclusion(Relation heapRelation,
  * necessary to be sure there are none left with a transaction snapshot
  * older than the reference (and hence possibly able to see tuples we did
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
- * transactions will be able to use it for queries.
- *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * transactions will be able to use it for queries. Auxiliary index is
+ * dropped.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * rest for auxiliary */
+	int			main_work_mem_part = (maintenance_work_mem * 8) / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3380,13 +3553,18 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3404,15 +3582,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   maintenance_work_mem - main_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3435,27 +3628,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
+	/* Done with tuplesort objects */
 	tuplesort_end(state.tuplesort);
+	tuplesort_end(auxState.tuplesort);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3464,8 +3663,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3524,6 +3727,13 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indislive);
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index ad3082c62ac..fbbcd7d00dd 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a02729911fe..02b636a0050 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -554,6 +554,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -563,6 +564,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -584,10 +586,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -834,6 +836,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -1227,7 +1238,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1569,6 +1581,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1597,11 +1619,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1611,7 +1633,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1632,14 +1654,16 @@ DefineIndex(Oid tableId,
 	{
 		const int	progress_cols[] = {
 			PROGRESS_CREATEIDX_INDEX_OID,
+			PROGRESS_CREATEIDX_AUX_INDEX_OID,
 			PROGRESS_CREATEIDX_PHASE
 		};
 		const int64 progress_vals[] = {
 			indexRelationId,
+			auxIndexRelationId,
 			PROGRESS_CREATEIDX_PHASE_WAIT_1
 		};
 
-		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
+		pgstat_progress_update_multi_param(3, progress_cols, progress_vals);
 	}
 
 	/*
@@ -1650,7 +1674,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1662,15 +1686,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1679,7 +1727,7 @@ DefineIndex(Oid tableId,
 	 */
 
 	/* Perform concurrent build of index */
-	index_concurrently_build(tableId, indexRelationId);
+	index_concurrently_build(tableId, indexRelationId, false);
 
 	/*
 	 * Commit this transaction to make the indisready update visible.
@@ -1698,43 +1746,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transaction. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1747,6 +1780,49 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
+	 */
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
@@ -1757,12 +1833,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -3542,6 +3618,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3563,9 +3640,10 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PROGRESS_CREATEIDX_COMMAND,
 		PROGRESS_CREATEIDX_PHASE,
 		PROGRESS_CREATEIDX_INDEX_OID,
+		PROGRESS_CREATEIDX_AUX_INDEX_OID,
 		PROGRESS_CREATEIDX_ACCESS_METHOD_OID
 	};
-	int64		progress_vals[4];
+	int64		progress_vals[5];
 
 	/*
 	 * Create a memory context that will survive forced transaction commits we
@@ -3865,15 +3943,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3915,8 +3996,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
 		progress_vals[1] = 0;	/* initializing */
 		progress_vals[2] = idx->indexId;
-		progress_vals[3] = idx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		progress_vals[3] = InvalidOid;
+		progress_vals[4] = idx->amId;
+		pgstat_progress_update_multi_param(5, progress_index, progress_vals);
 
 		/* Choose a temporary relation name for the new index */
 		concurrentName = ChooseRelationName(get_rel_name(idx->indexId),
@@ -3924,6 +4006,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3937,12 +4024,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   idx->indexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3951,6 +4043,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3969,10 +4062,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4053,13 +4150,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId, true);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4086,11 +4225,12 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
 		progress_vals[1] = PROGRESS_CREATEIDX_PHASE_BUILD;
 		progress_vals[2] = newidx->indexId;
-		progress_vals[3] = newidx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		progress_vals[3] = newidx->auxIndexId;
+		progress_vals[4] = newidx->amId;
+		pgstat_progress_update_multi_param(5, progress_index, progress_vals);
 
 		/* Perform concurrent build of new index */
-		index_concurrently_build(newidx->tableId, newidx->indexId);
+		index_concurrently_build(newidx->tableId, newidx->indexId, false);
 
 		CommitTransactionCommand();
 	}
@@ -4102,24 +4242,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4134,13 +4302,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4149,19 +4310,12 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
 		progress_vals[1] = PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN;
 		progress_vals[2] = newidx->indexId;
-		progress_vals[3] = newidx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		progress_vals[3] = newidx->auxIndexId;
+		progress_vals[4] = newidx->amId;
+		pgstat_progress_update_multi_param(5, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4181,7 +4335,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4271,14 +4425,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4303,6 +4457,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4316,11 +4492,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4340,6 +4516,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index ec3769585c3..d881241f837 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,11 +714,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1866,22 +1866,22 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c3..82d0d6b46d3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,8 +103,14 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
-									 Oid indexRelationId);
+									 Oid indexRelationId,
+									 bool auxiliary);
 
 extern void index_concurrently_swap(Oid newIndexId,
 									Oid oldIndexId,
@@ -145,7 +154,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d645230..89f8d02fdc3 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -88,6 +88,7 @@
 #define PROGRESS_CREATEIDX_TUPLES_DONE			12
 #define PROGRESS_CREATEIDX_PARTITIONS_TOTAL		13
 #define PROGRESS_CREATEIDX_PARTITIONS_DONE		14
+#define PROGRESS_CREATEIDX_AUX_INDEX_OID		15
 /* 15 and 16 reserved for "block number" metrics */
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
@@ -96,10 +97,11 @@
 #define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
 #define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
 #define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	6
 #define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 1904eb65bb9..7e008b1cbd9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3015,6 +3016,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3027,8 +3029,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c085e05f052..c44e460b0d3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1239,10 +1240,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-12-24 19:39  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  1 sibling, 0 replies; 33+ messages in thread

From: Michail Nikolaev @ 2024-12-24 19:39 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello!

Rebased + snapshot resetting during validation + removed PROC_IN_SAFE_IC.
Going to do some benchmarks soon.

Best regards,
Mikhail.

>


Attachments:

  [application/octet-stream] v9-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patch (35.1K, 3-v9-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patch)
  download | inline diff:
From 86d498d18c232a62c4da4e5849258c1ab09f69b3 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v9 5/9] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 173 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  67 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 251 insertions(+), 84 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8144743c338..0f706553605 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 783489600fc..38355601421 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	bool		wait_for_snapshot_attach;
 	int			querylen;
 
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case when leader going to reset own active snapshot as well - we need to
 	 * wait until all workers imported initial snapshot.
 	 */
-	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
 
 	if (wait_for_snapshot_attach)
 		WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index a531d37908a..e729b4a4d7c 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4676,7 +4674,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4794,17 +4792,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4830,6 +4835,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4849,7 +4856,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4860,7 +4867,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4869,6 +4877,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4877,7 +4887,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4894,6 +4905,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index fcb6e940ff2..73454accf61 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3293,9 +3293,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6c1fce8ed25..a02729911fe 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,8 +1670,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 66e1ad83f1a..0ecc3147bbd 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1799,9 +1799,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v9-0002-Add-stress-tests-for-concurrent-index-operations.patch (6.5K, 4-v9-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 23c3c9f06ca446f1b2840c18e511a11c827cbc14 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v9 2/9] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 144 ++++++++++++++++++++++++++++++++
 2 files changed, 145 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..4a8f4fbc8b0 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..142e8fb845e
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,144 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v9-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch (30.1K, 5-v9-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch)
  download | inline diff:
From 43662a22363ddab775ec4373711be0cf39bcc1be Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v9 4/9] Allow snapshot resets during parallel concurrent index
 builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
  proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 43 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 +++--
 src/backend/access/nbtree/nbtsort.c           | 38 ++++++++++++--
 src/backend/access/table/tableam.c            | 37 ++++++++++++--
 src/backend/access/transam/parallel.c         | 50 +++++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 ++--
 .../expected/cic_reset_snapshots.out          | 23 ++++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 13 files changed, 179 insertions(+), 57 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d80394766d5..f076cedcc2c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d9fce07e8ad..8144743c338 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8647422ed05..783489600fc 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c5a900f1b29..fcb6e940ff2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 101a02c5b60..153ac28db3e 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8ca8f789617..d801aca82a5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a328f3aea6b..66e1ad83f1a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1180,7 +1180,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1798,9 +1799,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 5db54530f17..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,24 +78,35 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -97,7 +114,9 @@ REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v9-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch (36.3K, 6-v9-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch)
  download | inline diff:
From 4ee802bb929b4d401a3c69b879275fde06591866 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v9 3/9] Allow advancing xmin during non-unique, non-parallel 
 concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  14 +++
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  14 +++
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 107 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 15 files changed, 384 insertions(+), 31 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 9af445cdcdd..d80394766d5 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 329e727f80d..c2860ebbf32 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -568,6 +569,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -609,7 +640,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1236,6 +1273,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 53f572f384b..d9fce07e8ad 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4b4ebff6a17..a104ba9df74 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 28522c0ac1c..8647422ed05 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6976249e9e9..c5a900f1b29 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3206,7 +3223,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3269,12 +3287,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7468961b017..1ef6c7216f4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6778,6 +6779,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6833,6 +6835,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6890,6 +6897,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index bb32de11ea0..a328f3aea6b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1775,6 +1797,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..5db54530f17
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,107 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v9-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch (61.5K, 7-v9-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch)
  download | inline diff:
From d694020bb8c9b8fa6e346029bba2500c0a0f06cc Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v9 1/9] this is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 ++++++++-
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++---
 src/backend/utils/time/snapmgr.c              |   2 +
 src/test/modules/injection_points/Makefile    |   7 +-
 .../expected/index_concurrently_upsert.out    |  80 ++++++
 .../index_concurrently_upsert_predicate.out   |  80 ++++++
 .../expected/reindex_concurrently_upsert.out  | 238 ++++++++++++++++++
 ...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
 ...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |  11 +
 .../specs/index_concurrently_upsert.spec      |  68 +++++
 .../index_concurrently_upsert_predicate.spec  |  70 ++++++
 .../specs/reindex_concurrently_upsert.spec    |  86 +++++++
 ...dex_concurrently_upsert_on_constraint.spec |  86 +++++++
 ...index_concurrently_upsert_partitioned.spec |  88 +++++++
 18 files changed, 1505 insertions(+), 50 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c445c433df4..67befb6cba6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c31cc3ee69f..b4f9641e588 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6eb29b99735..101a02c5b60 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -388,6 +389,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
 REGRESS = injection_points reindex_conc
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+			reindex_concurrently_upsert \
+			index_concurrently_upsert \
+			reindex_concurrently_upsert_partitioned \
+			reindex_concurrently_upsert_on_constraint \
+			index_concurrently_upsert_predicate
 
 TAP_TESTS = 1
 
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
     'specs': [
       'basic',
       'inplace',
+      'reindex_concurrently_upsert',
+      'index_concurrently_upsert',
+      'reindex_concurrently_upsert_partitioned',
+      'reindex_concurrently_upsert_on_constraint',
+      'index_concurrently_upsert_predicate',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
+    # We waiting for all snapshots, so, avoid parallel test executions
+    'runningcheck-parallel': false,
   },
   'tap': {
     'env': {
@@ -53,5 +62,7 @@ tests += {
     'tests': [
       't/001_stats.pl',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
   },
 }
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+	CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+	CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+		FOR VALUES FROM (0) TO (10000)
+		WITH (parallel_workers = 0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v9-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patch (37.3K, 8-v9-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patch)
  download | inline diff:
From 2976d46c4c65c844c1fe5c369c6b9942ccaf14cb Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v9 6/9] Add STIR (Short-Term Index Replacement) access method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 576 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 780 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f2ca9430581..bec79b48cb2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 62a371db7f7..63ee0ef134d 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..83aa255176f
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,576 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	/* Initialize contents of meta page */
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+	GenericXLogFinish(state);
+
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	GenericXLogState *state;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			state = GenericXLogStart(index);
+			page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				GenericXLogFinish(state);
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			/* Didn't fit, must try other pages */
+			GenericXLogAbort(state);
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		state = GenericXLogStart(index);
+		metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again /
+			 */
+			GenericXLogAbort(state);
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+
+			page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+			GenericXLogFinish(state);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	metaData = StirPageGetMeta(metaPage);
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		GenericXLogFinish(state);
+	}
+	else
+	{
+		GenericXLogAbort(state);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 73454accf61..7ff7ab6c72a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3403,6 +3403,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282f..d54d310ba43 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a564..e4327b4f7dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 7e5df7bea4d..44a8a1f2875 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81653febc18..194dbbe1d0e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index df6923c9d50..0966397d344 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index db874902820..51350df0bf0 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index f503c652ebc..a8f0e66d15b 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index c8ac8c73def..41ea0c3ca50 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2dcc2d42dac..34564109e50 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1590b643920..7d4e43148e6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index a41cd2b7fd9..61f3d3dea0c 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v9-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-a.patch (76.2K, 9-v9-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-a.patch)
  download | inline diff:
From 6e38968bc529c4c72d3473d19405f5e3b79d1ff2 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 24 Dec 2024 13:40:45 +0100
Subject: [PATCH v9 7/9] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track
  new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase
  instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 src/backend/access/heap/heapam_handler.c      | 383 +++++++++---------
 src/backend/catalog/index.c                   | 280 +++++++++++--
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 362 +++++++++++++----
 src/include/access/tableam.h                  |  28 +-
 src/include/catalog/index.h                   |  15 +-
 src/include/commands/progress.h               |   4 +-
 .../expected/cic_reset_snapshots.out          |  28 ++
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |   4 +
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/sql/create_index.sql         |   3 +
 12 files changed, 791 insertions(+), 323 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0f706553605..ecec3c1c080 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,266 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	IndexFetchTableData *fetch;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
 
 	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL,
+					prev_indexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded,
+					prev_decoded,
+					fetched;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
+	/*
+	 * Now take the "reference snapshot" that will be used by to filter candidate
+	 * tuples.  Beware!  There might still be snapshots in
+	 * use that treat some transaction as in-progress that our reference
+	 * snapshot treats as committed.  If such a recently-committed transaction
+	 * deleted tuples in the table, we will not include them in the index; yet
+	 * those transactions which see the deleting one as still-in-progress will
+	 * expect such tuples to be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	fetch = heapam_index_fetch_begin(heapRelation);
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&prev_decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+	ItemPointerSetInvalid(&fetched);
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	/* We'll track the last "main" index position in prev_indexcursor. */
+	prev_indexcursor = &prev_decoded;
 
 	/*
-	 * Scan all tuples matching the snapshot.
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must be merged with or compared to those from
+	 * the "main" sort (state->tuplesort).
 	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while (!auxtuplesort_empty)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
-
+		Datum		ts_val;
+		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
-
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
-		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
-		}
-
 		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(auxState->tuplesort, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
 		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
+		else
 		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
+			auxindexcursor = NULL;
 		}
 
 		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
 			{
+				/* Keep track of the previous TID in prev_decoded. */
+				prev_decoded = decoded;
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+
+					/*
+					 * If the current TID in the main sort is a duplicate of the
+					 * previous one (prev_indexcursor), skip it to avoid
+					 * double-inserting the same TID. Such situation is possible
+					 * due concurrent page splits in btree (and, probabaly other
+					 * indexes as well).
+					 */
+					if (ItemPointerCompare(prev_indexcursor, indexcursor) == 0)
+					{
+						elog(DEBUG5, "skipping duplicate tid in target index snapshot: (%u,%u)",
+							 ItemPointerGetBlockNumber(indexcursor),
+							 ItemPointerGetOffsetNumber(indexcursor));
+					}
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
 			}
-		}
-
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
 
 			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
 			 */
-			if (predicate != NULL)
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
 			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
+				bool call_again = false;
+				bool all_dead = false;
+				ItemPointer tid;
 
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
+				/* Copy the auxindexcursor TID into fetched. */
+				fetched = *auxindexcursor;
+				tid = &fetched;
 
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				state->htups += 1;
 
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
+				/*
+				 * Fetch the tuple from the heap to see if it's visible
+				 * under our snapshot. If it is, form the index key values
+				 * and insert a new entry into the target index.
+				 */
+				if (heapam_index_fetch_tuple(fetch, tid, snapshot, slot, &call_again, &all_dead))
+				{
+
+					/* Compute the key values and null flags for this tuple. */
+					FormIndexDatum(indexInfo,
+								   slot,
+								   estate,
+								   values,
+								   isnull);
+
+					/*
+					 * Insert the tuple into the target index.
+					 */
+					index_insert(indexRelation,
+								 values,
+								 isnull,
+								 auxindexcursor, /* insert root tuple */
+								 heapRelation,
+								 indexInfo->ii_Unique ?
+								 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+								 false,
+								 indexInfo);
+
+					state->tups_inserted += 1;
+
+					elog(DEBUG5, "inserted tid: (%u,%u), root: (%u, %u)",
+											ItemPointerGetBlockNumber(auxindexcursor),
+											ItemPointerGetOffsetNumber(auxindexcursor),
+											ItemPointerGetBlockNumber(tid),
+											ItemPointerGetOffsetNumber(tid));
+				}
+				else
+				{
+					/*
+					 * The tuple wasn't visible under our snapshot. We
+					 * skip inserting it into the target index because
+					 * from our perspective, it doesn't exist.
+					 */
+					elog(DEBUG5, "skipping insert to target index because tid not visible: (%u,%u)",
+						 ItemPointerGetBlockNumber(auxindexcursor),
+						 ItemPointerGetOffsetNumber(auxindexcursor));
+				}
+			}
 		}
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	heapam_index_fetch_end(fetch);
+
+	/*
+	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7ff7ab6c72a..8b14f66affc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -719,6 +719,9 @@ UpdateIndexRelation(Oid indexoid,
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -743,7 +746,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -754,11 +758,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +788,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +795,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1462,7 +1470,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1472,6 +1481,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false, /* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false, /* aux are not summarizing */
+							oldInfo->ii_WithoutOverlaps);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -1483,7 +1640,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
  */
 void
 index_concurrently_build(Oid heapRelationId,
-						 Oid indexRelationId)
+						 Oid indexRelationId,
+						 bool auxiliary)
 {
 	Relation	heapRel;
 	Oid			save_userid;
@@ -1524,6 +1682,7 @@ index_concurrently_build(Oid heapRelationId,
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	indexInfo->ii_Auxiliary = auxiliary;
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
@@ -3276,12 +3435,20 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We wait again for all
+ * transactions that could have been modifying the table to terminate. At that
+ * moment all new tuples are going to be inserted into auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3292,6 +3459,7 @@ IndexCheckExclusion(Relation heapRelation,
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
+ * But theese tuples contained in auxiliary index.
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
@@ -3301,8 +3469,10 @@ IndexCheckExclusion(Relation heapRelation,
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At that moment we clear "indisready" for
+ * auxiliary index, since it is no more required/
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3310,12 +3480,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3331,24 +3503,25 @@ IndexCheckExclusion(Relation heapRelation,
  * necessary to be sure there are none left with a transaction snapshot
  * older than the reference (and hence possibly able to see tuples we did
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
- * transactions will be able to use it for queries.
- *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * transactions will be able to use it for queries. Auxiliary index is
+ * dropped.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * rest for auxiliary */
+	int			main_work_mem_part = (maintenance_work_mem * 8) / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3381,13 +3554,18 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3405,15 +3583,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   maintenance_work_mem - main_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3436,27 +3629,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
+	/* Done with tuplesort objects */
 	tuplesort_end(state.tuplesort);
+	tuplesort_end(auxState.tuplesort);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3465,8 +3664,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3525,6 +3728,13 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indislive);
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index ad3082c62ac..fbbcd7d00dd 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a02729911fe..02b636a0050 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -554,6 +554,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -563,6 +564,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -584,10 +586,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -834,6 +836,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -1227,7 +1238,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1569,6 +1581,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1597,11 +1619,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1611,7 +1633,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1632,14 +1654,16 @@ DefineIndex(Oid tableId,
 	{
 		const int	progress_cols[] = {
 			PROGRESS_CREATEIDX_INDEX_OID,
+			PROGRESS_CREATEIDX_AUX_INDEX_OID,
 			PROGRESS_CREATEIDX_PHASE
 		};
 		const int64 progress_vals[] = {
 			indexRelationId,
+			auxIndexRelationId,
 			PROGRESS_CREATEIDX_PHASE_WAIT_1
 		};
 
-		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
+		pgstat_progress_update_multi_param(3, progress_cols, progress_vals);
 	}
 
 	/*
@@ -1650,7 +1674,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1662,15 +1686,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1679,7 +1727,7 @@ DefineIndex(Oid tableId,
 	 */
 
 	/* Perform concurrent build of index */
-	index_concurrently_build(tableId, indexRelationId);
+	index_concurrently_build(tableId, indexRelationId, false);
 
 	/*
 	 * Commit this transaction to make the indisready update visible.
@@ -1698,43 +1746,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transaction. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1747,6 +1780,49 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
+	 */
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
@@ -1757,12 +1833,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -3542,6 +3618,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3563,9 +3640,10 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PROGRESS_CREATEIDX_COMMAND,
 		PROGRESS_CREATEIDX_PHASE,
 		PROGRESS_CREATEIDX_INDEX_OID,
+		PROGRESS_CREATEIDX_AUX_INDEX_OID,
 		PROGRESS_CREATEIDX_ACCESS_METHOD_OID
 	};
-	int64		progress_vals[4];
+	int64		progress_vals[5];
 
 	/*
 	 * Create a memory context that will survive forced transaction commits we
@@ -3865,15 +3943,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3915,8 +3996,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
 		progress_vals[1] = 0;	/* initializing */
 		progress_vals[2] = idx->indexId;
-		progress_vals[3] = idx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		progress_vals[3] = InvalidOid;
+		progress_vals[4] = idx->amId;
+		pgstat_progress_update_multi_param(5, progress_index, progress_vals);
 
 		/* Choose a temporary relation name for the new index */
 		concurrentName = ChooseRelationName(get_rel_name(idx->indexId),
@@ -3924,6 +4006,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3937,12 +4024,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   idx->indexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3951,6 +4043,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3969,10 +4062,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4053,13 +4150,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId, true);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4086,11 +4225,12 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
 		progress_vals[1] = PROGRESS_CREATEIDX_PHASE_BUILD;
 		progress_vals[2] = newidx->indexId;
-		progress_vals[3] = newidx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		progress_vals[3] = newidx->auxIndexId;
+		progress_vals[4] = newidx->amId;
+		pgstat_progress_update_multi_param(5, progress_index, progress_vals);
 
 		/* Perform concurrent build of new index */
-		index_concurrently_build(newidx->tableId, newidx->indexId);
+		index_concurrently_build(newidx->tableId, newidx->indexId, false);
 
 		CommitTransactionCommand();
 	}
@@ -4102,24 +4242,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4134,13 +4302,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4149,19 +4310,12 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
 		progress_vals[1] = PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN;
 		progress_vals[2] = newidx->indexId;
-		progress_vals[3] = newidx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		progress_vals[3] = newidx->auxIndexId;
+		progress_vals[4] = newidx->amId;
+		pgstat_progress_update_multi_param(5, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4181,7 +4335,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4271,14 +4425,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4303,6 +4457,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4316,11 +4492,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4340,6 +4516,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0ecc3147bbd..fa1bdca7e2b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,11 +714,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1862,22 +1862,22 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c3..82d0d6b46d3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,8 +103,14 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
-									 Oid indexRelationId);
+									 Oid indexRelationId,
+									 bool auxiliary);
 
 extern void index_concurrently_swap(Oid newIndexId,
 									Oid oldIndexId,
@@ -145,7 +154,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d645230..89f8d02fdc3 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -88,6 +88,7 @@
 #define PROGRESS_CREATEIDX_TUPLES_DONE			12
 #define PROGRESS_CREATEIDX_PARTITIONS_TOTAL		13
 #define PROGRESS_CREATEIDX_PARTITIONS_DONE		14
+#define PROGRESS_CREATEIDX_AUX_INDEX_OID		15
 /* 15 and 16 reserved for "block number" metrics */
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
@@ -96,10 +97,11 @@
 #define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
 #define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
 #define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	6
 #define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 1904eb65bb9..7e008b1cbd9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3015,6 +3016,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3027,8 +3029,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c085e05f052..c44e460b0d3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1239,10 +1240,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-- 
2.43.0



  [application/octet-stream] v9-0008-Concurrently-built-index-validation-uses-fresh-sn.patch (10.6K, 10-v9-0008-Concurrently-built-index-validation-uses-fresh-sn.patch)
  download | inline diff:
From 103989dcbe91603da753b7e9647ad12df888cfb4 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 24 Dec 2024 19:17:25 +0100
Subject: [PATCH v9 8/9] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.

The interval for replacing the snapshot is controlled by the `VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL` constant, which is currently set to 1000 milliseconds.
---
 src/backend/access/heap/README.HOT       | 15 +++++---
 src/backend/access/heap/heapam_handler.c | 45 ++++++++++++++++++------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/catalog/index.c              |  7 ++--
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 ++++++++
 6 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ecec3c1c080..1a041c5a77b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1806,27 +1806,35 @@ heapam_index_validate_scan(Relation heapRelation,
 					fetched;
 	bool			tuplesort_empty = false,
 					auxtuplesort_empty = false;
+	instr_time		snapshotTime,
+					currentTime;
 
 	Assert(!HaveRegisteredOrActiveSnapshot());
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
+#define VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL	1000
 	/*
-	 * Now take the "reference snapshot" that will be used by to filter candidate
-	 * tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
 	 *
 	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
+	 * we mark the index as valid, for that reason limitX is supported.
 	 *
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
+	INSTR_TIME_SET_CURRENT(snapshotTime);
 	limitXmin = snapshot->xmin;
 
 	/*
@@ -1868,6 +1876,23 @@ heapam_index_validate_scan(Relation heapRelation,
 		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
+		INSTR_TIME_SET_CURRENT(currentTime);
+		INSTR_TIME_SUBTRACT(currentTime, snapshotTime);
+		if (INSTR_TIME_GET_MILLISEC(currentTime) >= VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+			INSTR_TIME_SET_CURRENT(snapshotTime);
+		}
+
 		/*
 		* Attempt to fetch the next TID from the auxiliary sort. If it's
 		* empty, we set auxindexcursor to NULL.
@@ -2020,7 +2045,7 @@ heapam_index_validate_scan(Relation heapRelation,
 	heapam_index_fetch_end(fetch);
 
 	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * Drop the latest snapshot.  We must do this before waiting out other
 	 * snapshot holders, else we will deadlock against other processes also
 	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
 	 * they must wait for.
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 38355601421..60551f82bfa 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -442,7 +442,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8b14f66affc..b4df2b1eee6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3472,8 +3472,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At that moment we clear "indisready" for
  * auxiliary index, since it is no more required/
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3486,7 +3487,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 02b636a0050..71baeced508 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4328,7 +4328,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd5..90d358804e4 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [application/octet-stream] v9-0009-concurrent-index-build-Remove-PROC_IN_SAFE_IC-opt.patch (20.5K, 11-v9-0009-concurrent-index-build-Remove-PROC_IN_SAFE_IC-opt.patch)
  download | inline diff:
From f4c00ab0c12b2af59e801d66d689d2378730a707 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 24 Dec 2024 19:36:25 +0100
Subject: [PATCH v9 9/9] concurrent index build: Remove PROC_IN_SAFE_IC
 optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 8 files changed, 10 insertions(+), 234 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index f076cedcc2c..048c7d7995b 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2886,11 +2886,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 60551f82bfa..c6f7e527b65 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1907,11 +1907,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 71baeced508..ae058dc701b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -116,7 +116,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -416,10 +415,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -440,8 +436,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -461,8 +456,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -576,7 +570,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1153,10 +1146,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1643,10 +1632,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1703,9 +1688,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1735,10 +1717,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1780,10 +1758,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Updating pg_index might involve TOAST table access, so ensure we
 	 * have a valid snapshot.
@@ -1795,10 +1769,6 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -1811,9 +1781,6 @@ DefineIndex(Oid tableId,
 	/*
 	 * Drop auxiliary index.
 	 *
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 *
 	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 	 * right lock level.
 	 */
@@ -1823,10 +1790,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
@@ -3621,7 +3584,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3973,17 +3935,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4044,7 +3995,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4137,11 +4087,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4172,10 +4117,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId, true);
 
@@ -4184,11 +4125,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4213,10 +4149,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4237,11 +4169,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4262,10 +4189,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4298,10 +4221,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4330,9 +4249,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4354,13 +4270,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4416,12 +4325,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4483,12 +4386,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4748,36 +4645,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5a3dd5d2d40..a8ee412397a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 73893d351bb..bc0a06a1274 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc cic_reset_snapshots
+REGRESS = injection_points cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index f288633da4f..73cb5e92fdc 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -34,7 +34,6 @@ tests += {
   'regress': {
     'sql': [
       'injection_points',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-12-25 05:19  Michael Paquier <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  1 sibling, 1 reply; 33+ messages in thread

From: Michael Paquier @ 2024-12-25 05:19 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Tue, Dec 24, 2024 at 02:06:26PM +0100, Michail Nikolaev wrote:
> Now STIR used for validation (but without resetting of snapshot during
> that phase for now).

Perhaps I am the only one, but what you are doing here is confusing.

There is a dependency between one patch and the follow-up ones, but
while the first patch is clear regarding its goal of improving the
interactions between REINDEX CONCURRENTLY and INSERT ON CONFLICT
regarding the selection of arbiter index in the executor in 0001 in
the scope of the other thread you have created about this problem, it
is unclear what's the goal of what you are trying to do with 0003~, if
any of the follow-up patches help with that, and even why they have a
need to be posted on this thread.  So perhaps you should split things
and explain what your goals are for each patch, or articulate better
why things are done this way?  It looks like more things just keep
piling each time a new patch series is sent to the lists.  Posting
300kB worth of patches every 3 days is not going to help potential
reviewers, just confuse them.

Note that 0002, that attempts to introduce new tests, is costly.  This
is not acceptable for integration.  I'd suggest to replace that with
tests that have controlled and successive steps as these lead to
predictible results, rather than have something that runs an arbitrary
amount of time to stress the friction of concurrent activity (this is
still useful to prove your point, though).  That's something related
to the other thread, but in passing..
--
Michael


Attachments:

  [application/pgp-signature] signature.asc (833B, 2-signature.asc)
  download

^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2024-12-25 15:14  Michail Nikolaev <[email protected]>
  parent: Michael Paquier <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2024-12-25 15:14 UTC (permalink / raw)
  To: Michael Paquier <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, Michael!

Thank you for your comments and feedback!

Yes, this patch set contains a significant amount of code, which makes it
challenging to review. Some details are explained in the commit messages,
but I’m doing my best to structure the patch set in a way that is as
committable as possible. Once all the parts are ready, I plan to write a
detailed letter explaining everything, including benchmark results and
other relevant information.

Meanwhile, here’s a quick overview of the patch structure. If you have
suggestions for an alternative decomposition approach, I’d be happy to hear.
The primary goals of the patch set are to:
    * Enable the xmin horizon to propagate freely during concurrent index
builds
    * Build concurrent indexes with a single heap scan

The patch set is split into the following parts. Technically, each part
could be committed separately, but all of them are required to achieve the
goals.

Part 1: Stress tests
- 0001: Yes, this patch is from another thread and not directly required,
it’s included here as a single commit because it’s necessary for stress
testing this patch set. Without it, issues with concurrent reindexing and
upserts cause failures.
- 0002: Yes, I agree these tests need to be refactored or moved into a
separate task. I’ll address this later.

Part 2: During the first phase of concurrently building a  index, reset the
snapshot used for heap scans between pages, allowing xmin to go forward.
- 0003: Implement such snapshot resetting for non-parallel and non-unique
cases
- 0004: Extends snapshot resetting to parallel builds
- 0005: Extends snapshot resetting to unique indexes

Part 3: Build concurrent indexes in a single heap scan
- 0006: Introduces the STIR (Short-Term Index Replacement) access method, a
specialized method for auxiliary indexes during concurrent builds
- 0007: Implements the auxiliary index approach, enabling concurrent index
builds to use a single heap scan.
            In a few words, it works like this: create an empty auxiliary
STIR index to track new tuples, scan heap and build new index, merge STIR
tuples into new index, drop auxiliary index.
- 0008: Enhances the auxiliary index approach by resetting snapshots during
the merge phase, allowing xmin to propagate

Part 4: This part depends on all three previous parts being committed to
make sense (other parts are possible to apply separately).
- 0009:  Remove PROC_IN_SAFE_IC logic, as it is no more required

I have a plan to add a few additional small things (optimizations) and then
do some scaled stress-testing and benchmarking. I think that without it, no
one is going to spend his time for such an amount of code :)

Merry Christmas,
Mikhail.


^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-01-01 16:16  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 2 replies; 33+ messages in thread

From: Michail Nikolaev @ 2025-01-01 16:16 UTC (permalink / raw)
  To: Michael Paquier <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, everyone!

I’ve added several updates to the patch set:

* Automatic auxiliary index removal where applicable.
* Documentation updates to reflect recent changes.
* Optimization for STIR indexes: skipping datum setup, as they store only
TIDs.
* Numerous assertions to ensure that MyProc->xmin is invalid where
necessary.

I’d like to share some initial benchmark results (see attached graphs).
This involves building a B-tree index on (aid, abalance) in a pgbench setup
with scale 2000 (with WAL), while running a concurrent pgbench workload.

The patched version built the index in 68 seconds, compared to 117 seconds
with the master branch (mostly because of a single heap scan).
There appears to be no effect on the throughput of the concurrent pgbench.
The maximum snapshot age remains near zero.

I am going to continue to benchmark with different options: different HOT
setup, unique index, different index types and DB size (100+ GB).
If someone has some ideas about possible benchmark scenarios - please share.

Best regards,
Mikhail.

[image: image.png]

> [image: image.png]


Attachments:

  [image/png] image.png (48.3K, 3-image.png)
  download | view image

  [image/png] image.png (34.1K, 4-image.png)
  download | view image

  [application/octet-stream] v10-0010-Add-proper-handling-of-auxiliary-indexes-during-.patch (28.7K, 5-v10-0010-Add-proper-handling-of-auxiliary-indexes-during-.patch)
  download | inline diff:
From 317d55f3419678413bad1611cc788cc4aa0b4140 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v10 10/11] Add proper handling of auxiliary indexes during
 DROP/REINDEX operations

During concurrent index operations, an auxiliary index may be created to help
with the process. In case of error during the building process (for example in case of index constraint violation) such indexes became junk-indexes without any function. This patch improves the handling of such auxiliary indexes:

* Add auxiliaryForIndexId parameter to index_create() to track dependencies
* Automatically drop auxiliary indexes when the main index is dropped
* Delete junk auxiliary indexes properly during REINDEX operations
* Add regression tests to verify new behaviour
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 54566223cb0..fb7cd15f5fe 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -661,10 +661,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6e82ae63990..f5181811198 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -473,14 +473,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 2afc550540c..ad02282fef5 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f47bbca9dbd..de636527444 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -687,6 +687,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -733,6 +735,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -775,6 +778,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1176,6 +1181,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1458,6 +1472,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1608,6 +1623,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3829,6 +3845,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3885,6 +3902,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4173,7 +4203,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4262,13 +4293,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4294,18 +4342,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 2b4514e8a35..99cca682402 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index fbbcd7d00dd..a5003941685 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 478f43fba0b..c4d69ea4dc1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1224,7 +1224,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3593,6 +3593,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 	} ReindexIndexInfo;
@@ -3941,6 +3942,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3948,6 +3950,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4010,12 +4013,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4025,6 +4033,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4045,10 +4054,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4205,7 +4222,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4224,6 +4242,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4406,6 +4427,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4451,6 +4474,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 49374782625..881e7ca2528 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1492,6 +1492,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1552,9 +1554,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1606,6 +1619,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1634,12 +1675,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 6908ca7180a..af48be04f27 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index c39eed24f1a..b2078f0dd53 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 7566425302f..0c9de428527 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3083,20 +3083,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 1df3409696e..9c362b3158f 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1268,11 +1268,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v10-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch (101.5K, 6-v10-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch)
  download | inline diff:
From a0fb9778d7e156124cc70addd36c031b2adc3005 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v10 07/11] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 doc/src/sgml/monitoring.sgml                  |  22 +-
 doc/src/sgml/ref/create_index.sgml            |  33 +-
 doc/src/sgml/ref/reindex.sgml                 |  43 +-
 src/backend/access/heap/heapam_handler.c      | 383 +++++++++---------
 src/backend/catalog/index.c                   | 308 ++++++++++++--
 src/backend/catalog/system_views.sql          |  17 +-
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 376 +++++++++++++----
 src/backend/nodes/makefuncs.c                 |   4 +-
 src/include/access/tableam.h                  |  28 +-
 src/include/catalog/index.h                   |  12 +-
 src/include/commands/progress.h               |  13 +-
 src/include/nodes/execnodes.h                 |   4 +-
 src/include/nodes/makefuncs.h                 |   3 +-
 .../expected/cic_reset_snapshots.out          |  28 ++
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |  42 ++
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/expected/rules.out           |  17 +-
 src/test/regress/sql/create_index.sql         |  21 +
 20 files changed, 979 insertions(+), 382 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d0d176cc54f..a35d31bd02f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6202,6 +6202,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6242,10 +6254,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
        Columns <structname>blocks_total</structname> (set to the total size of the table)
        and <structname>blocks_done</structname> contain the progress information for this phase.
@@ -6265,8 +6276,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 208389e8006..e33345f6a34 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -614,25 +614,24 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
-    significantly longer to complete.  However, since it allows normal
+    <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
+    This method requires more total work than a standard index build and takes
+    longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
     and I/O load imposed by the index creation might slow other operations.
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
+    In a concurrent index build, the main and auxiliary indexes is actually entered as an
     <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -645,10 +644,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -658,11 +658,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index dcf70d14bc3..c76d8edd291 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -367,11 +367,10 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
-    rebuild and takes significantly longer to complete as it needs to wait
+    rebuild and takes longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
     it allows normal operations to continue while the index is being rebuilt, this
     method is useful for rebuilding indexes in a production environment. Of
@@ -387,7 +386,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -397,7 +396,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+       para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -408,9 +415,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -427,7 +434,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -435,7 +442,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -446,11 +453,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -461,12 +468,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0f706553605..ecec3c1c080 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,266 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	IndexFetchTableData *fetch;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
 
 	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL,
+					prev_indexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded,
+					prev_decoded,
+					fetched;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
+	/*
+	 * Now take the "reference snapshot" that will be used by to filter candidate
+	 * tuples.  Beware!  There might still be snapshots in
+	 * use that treat some transaction as in-progress that our reference
+	 * snapshot treats as committed.  If such a recently-committed transaction
+	 * deleted tuples in the table, we will not include them in the index; yet
+	 * those transactions which see the deleting one as still-in-progress will
+	 * expect such tuples to be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	fetch = heapam_index_fetch_begin(heapRelation);
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&prev_decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+	ItemPointerSetInvalid(&fetched);
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	/* We'll track the last "main" index position in prev_indexcursor. */
+	prev_indexcursor = &prev_decoded;
 
 	/*
-	 * Scan all tuples matching the snapshot.
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must be merged with or compared to those from
+	 * the "main" sort (state->tuplesort).
 	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while (!auxtuplesort_empty)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
-
+		Datum		ts_val;
+		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
-
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
-		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
-		}
-
 		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(auxState->tuplesort, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
 		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
+		else
 		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
+			auxindexcursor = NULL;
 		}
 
 		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
 			{
+				/* Keep track of the previous TID in prev_decoded. */
+				prev_decoded = decoded;
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+
+					/*
+					 * If the current TID in the main sort is a duplicate of the
+					 * previous one (prev_indexcursor), skip it to avoid
+					 * double-inserting the same TID. Such situation is possible
+					 * due concurrent page splits in btree (and, probabaly other
+					 * indexes as well).
+					 */
+					if (ItemPointerCompare(prev_indexcursor, indexcursor) == 0)
+					{
+						elog(DEBUG5, "skipping duplicate tid in target index snapshot: (%u,%u)",
+							 ItemPointerGetBlockNumber(indexcursor),
+							 ItemPointerGetOffsetNumber(indexcursor));
+					}
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
 			}
-		}
-
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
 
 			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
 			 */
-			if (predicate != NULL)
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
 			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
+				bool call_again = false;
+				bool all_dead = false;
+				ItemPointer tid;
 
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
+				/* Copy the auxindexcursor TID into fetched. */
+				fetched = *auxindexcursor;
+				tid = &fetched;
 
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				state->htups += 1;
 
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
+				/*
+				 * Fetch the tuple from the heap to see if it's visible
+				 * under our snapshot. If it is, form the index key values
+				 * and insert a new entry into the target index.
+				 */
+				if (heapam_index_fetch_tuple(fetch, tid, snapshot, slot, &call_again, &all_dead))
+				{
+
+					/* Compute the key values and null flags for this tuple. */
+					FormIndexDatum(indexInfo,
+								   slot,
+								   estate,
+								   values,
+								   isnull);
+
+					/*
+					 * Insert the tuple into the target index.
+					 */
+					index_insert(indexRelation,
+								 values,
+								 isnull,
+								 auxindexcursor, /* insert root tuple */
+								 heapRelation,
+								 indexInfo->ii_Unique ?
+								 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+								 false,
+								 indexInfo);
+
+					state->tups_inserted += 1;
+
+					elog(DEBUG5, "inserted tid: (%u,%u), root: (%u, %u)",
+											ItemPointerGetBlockNumber(auxindexcursor),
+											ItemPointerGetOffsetNumber(auxindexcursor),
+											ItemPointerGetBlockNumber(tid),
+											ItemPointerGetOffsetNumber(tid));
+				}
+				else
+				{
+					/*
+					 * The tuple wasn't visible under our snapshot. We
+					 * skip inserting it into the target index because
+					 * from our perspective, it doesn't exist.
+					 */
+					elog(DEBUG5, "skipping insert to target index because tid not visible: (%u,%u)",
+						 ItemPointerGetBlockNumber(auxindexcursor),
+						 ItemPointerGetOffsetNumber(auxindexcursor));
+				}
+			}
 		}
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	heapam_index_fetch_end(fetch);
+
+	/*
+	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7ff7ab6c72a..e515383b288 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -743,7 +748,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -754,11 +760,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +797,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1407,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1462,7 +1473,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1472,6 +1484,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2467,7 +2628,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2527,7 +2689,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3276,12 +3439,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3291,18 +3463,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (ut these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3310,12 +3485,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3333,22 +3510,24 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * rest for auxiliary */
+	int			main_work_mem_part = (maintenance_work_mem * 8) / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3381,13 +3560,18 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3405,15 +3589,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   maintenance_work_mem - main_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3436,27 +3635,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
+	/* Done with tuplesort objects */
 	tuplesort_end(state.tuplesort);
+	tuplesort_end(auxState.tuplesort);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3465,8 +3670,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3525,6 +3734,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3796,6 +4010,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4038,6 +4259,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4063,6 +4285,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f2..7045174c556 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1265,16 +1265,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index ad3082c62ac..fbbcd7d00dd 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a02729911fe..3b1abe660f9 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -554,6 +557,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -563,6 +567,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -584,10 +589,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -834,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -929,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1227,7 +1242,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1569,6 +1585,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1597,11 +1623,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1611,7 +1637,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1650,7 +1676,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1662,15 +1688,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1698,43 +1748,31 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1757,12 +1795,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1787,6 +1825,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3542,6 +3627,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3647,8 +3733,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3700,8 +3793,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3762,6 +3862,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3865,15 +3972,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3924,6 +4034,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3937,12 +4052,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3951,6 +4071,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3969,10 +4090,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4053,13 +4178,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4102,24 +4269,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4134,13 +4329,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4152,16 +4340,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4181,7 +4361,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4271,14 +4451,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4303,6 +4483,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4316,11 +4518,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4340,6 +4542,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 44a8a1f2875..5e457c9e8c0 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -784,7 +784,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -800,6 +800,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -825,7 +826,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0ecc3147bbd..fa1bdca7e2b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,11 +714,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1862,22 +1862,22 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c3..c39eed24f1a 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d645230..40b28f8def7 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -92,14 +92,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7d4e43148e6..3f33146bea0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -177,8 +177,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 028f8815d12..55149caad2a 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 1904eb65bb9..7566425302f 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3015,6 +3016,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3027,8 +3029,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3056,6 +3060,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fef..e0a46c0a42a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2013,14 +2013,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c085e05f052..1df3409696e 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1239,10 +1240,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1254,6 +1257,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v10-0008-Concurrently-built-index-validation-uses-fresh-s.patch (15.9K, 7-v10-0008-Concurrently-built-index-validation-uses-fresh-s.patch)
  download | inline diff:
From f220f0ab633588f50b33db787f1d84942f22f772 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:14:38 +0100
Subject: [PATCH v10 08/11] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.

The interval for replacing the snapshot is controlled by the `VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL` constant, which is currently set to 1000 milliseconds.
---
 doc/src/sgml/ref/create_index.sgml       | 11 ++++--
 doc/src/sgml/ref/reindex.sgml            | 11 +++---
 src/backend/access/heap/README.HOT       | 15 +++++---
 src/backend/access/heap/heapam_handler.c | 45 ++++++++++++++++++------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++++--
 src/backend/catalog/index.c              | 19 +++++++---
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 ++++++++
 9 files changed, 100 insertions(+), 32 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e33345f6a34..54566223cb0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -868,9 +868,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c76d8edd291..6e82ae63990 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -494,10 +494,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ecec3c1c080..1a041c5a77b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1806,27 +1806,35 @@ heapam_index_validate_scan(Relation heapRelation,
 					fetched;
 	bool			tuplesort_empty = false,
 					auxtuplesort_empty = false;
+	instr_time		snapshotTime,
+					currentTime;
 
 	Assert(!HaveRegisteredOrActiveSnapshot());
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
+#define VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL	1000
 	/*
-	 * Now take the "reference snapshot" that will be used by to filter candidate
-	 * tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
 	 *
 	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
+	 * we mark the index as valid, for that reason limitX is supported.
 	 *
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
+	INSTR_TIME_SET_CURRENT(snapshotTime);
 	limitXmin = snapshot->xmin;
 
 	/*
@@ -1868,6 +1876,23 @@ heapam_index_validate_scan(Relation heapRelation,
 		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
+		INSTR_TIME_SET_CURRENT(currentTime);
+		INSTR_TIME_SUBTRACT(currentTime, snapshotTime);
+		if (INSTR_TIME_GET_MILLISEC(currentTime) >= VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+			INSTR_TIME_SET_CURRENT(snapshotTime);
+		}
+
 		/*
 		* Attempt to fetch the next TID from the auxiliary sort. If it's
 		* empty, we set auxindexcursor to NULL.
@@ -2020,7 +2045,7 @@ heapam_index_validate_scan(Relation heapRelation,
 	heapam_index_fetch_end(fetch);
 
 	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * Drop the latest snapshot.  We must do this before waiting out other
 	 * snapshot holders, else we will deadlock against other processes also
 	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
 	 * they must wait for.
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67cfdc4721a..96bb38c6436 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0da069fd4d7..929cd33adcd 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -190,14 +190,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -811,7 +813,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -925,6 +926,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -965,6 +970,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e515383b288..f47bbca9dbd 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3477,8 +3477,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3491,7 +3492,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3607,19 +3608,29 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
 											main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/* Execute the sort */
 	{
 		const int	progress_index[] = {
@@ -3636,8 +3647,6 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	}
 	tuplesort_performsort(state.tuplesort);
 	tuplesort_performsort(auxState.tuplesort);
-
-	InvalidateCatalogSnapshot();
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 3b1abe660f9..163564a1464 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4354,7 +4354,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd5..90d358804e4 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [application/octet-stream] v10-0009-Remove-PROC_IN_SAFE_IC-optimization.patch (20.6K, 8-v10-0009-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 5851b21f38102e9b04eded2e0f0d5b0b62aefb0b Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v10 09/11] Remove PROC_IN_SAFE_IC optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 8 files changed, 11 insertions(+), 233 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e8ff8fa0e8f..1af739dda7c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2885,11 +2885,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 96bb38c6436..6e02c0871c5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 163564a1464..478f43fba0b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -116,7 +116,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -419,10 +418,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -443,8 +439,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -464,8 +459,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -579,7 +573,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1157,10 +1150,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1647,10 +1636,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1705,9 +1690,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1737,10 +1719,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1766,9 +1744,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1785,9 +1761,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1828,10 +1801,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1852,10 +1821,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3630,7 +3595,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -4002,17 +3966,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4072,7 +4025,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4165,11 +4117,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4200,10 +4147,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4212,11 +4155,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4241,10 +4179,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4264,11 +4198,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4289,10 +4218,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4325,10 +4250,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4356,9 +4277,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4380,13 +4298,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4442,12 +4353,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4509,12 +4414,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4774,36 +4673,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5a3dd5d2d40..a8ee412397a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 2225cd0bf87..b257a0344a8 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc cic_reset_snapshots
+REGRESS = injection_points cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 44cc028e82f..1b5064ac496 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -34,7 +34,6 @@ tests += {
   'regress': {
     'sql': [
       'injection_points',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v10-0011-Updates-index-insert-and-value-computation-logic.patch (2.2K, 9-v10-0011-Updates-index-insert-and-value-computation-logic.patch)
  download | inline diff:
From f334220256a4ee53b17bafa9755f8f0935888b22 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v10 11/11] Updates index insert and value computation logic to
 optimize auxiliary index handling.

* Skip index value computation for auxiliary indices since they are not needed
* Set indexUnchanged=false for auxiliary indices to avoid unnecessary checks
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index de636527444..123cf79c9a2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2929,6 +2929,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 820749239ca..08e1e6996e7 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -434,11 +434,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v10-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch (37.3K, 10-v10-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch)
  download | inline diff:
From 17089cfee742b2311de8b5c9ce39dfdf3836a895 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v10 06/11] Add STIR (Short-Term Index Replacement) access
 method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 576 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 780 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f2ca9430581..bec79b48cb2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 62a371db7f7..63ee0ef134d 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..83aa255176f
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,576 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	/* Initialize contents of meta page */
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+	GenericXLogFinish(state);
+
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	GenericXLogState *state;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			state = GenericXLogStart(index);
+			page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				GenericXLogFinish(state);
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			/* Didn't fit, must try other pages */
+			GenericXLogAbort(state);
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		state = GenericXLogStart(index);
+		metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again /
+			 */
+			GenericXLogAbort(state);
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+
+			page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+			GenericXLogFinish(state);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	metaData = StirPageGetMeta(metaPage);
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		GenericXLogFinish(state);
+	}
+	else
+	{
+		GenericXLogAbort(state);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 73454accf61..7ff7ab6c72a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3403,6 +3403,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282f..d54d310ba43 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a564..e4327b4f7dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 7e5df7bea4d..44a8a1f2875 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81653febc18..194dbbe1d0e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index df6923c9d50..0966397d344 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index db874902820..51350df0bf0 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index f503c652ebc..a8f0e66d15b 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index c8ac8c73def..41ea0c3ca50 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2dcc2d42dac..34564109e50 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1590b643920..7d4e43148e6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index a41cd2b7fd9..61f3d3dea0c 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v10-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch (43.2K, 11-v10-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch)
  download | inline diff:
From a44da4833555cee56649d43aeff3ed6c989fd0bc Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v10 03/11] Allow advancing xmin during non-unique,
 non-parallel concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  16 +++
 src/backend/access/gin/gininsert.c            |   3 +
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 19 files changed, 406 insertions(+), 34 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 9af445cdcdd..bc18dbd2ab3 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1224,6 +1224,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1243,6 +1244,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2366,6 +2368,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2394,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2446,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2527,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2545,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 31ee5650417..c0758e2410f 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -21,6 +21,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -375,6 +376,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	/*
 	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
 	 * prefers to receive tuples in TID order.
@@ -423,6 +425,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	return result;
 }
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 3a2759b4468..4ad1022edce 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 42c73ea5eb9..7a749fe8f64 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -191,6 +191,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 329e727f80d..c2860ebbf32 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -568,6 +569,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -609,7 +640,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1236,6 +1273,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 53f572f384b..d9fce07e8ad 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4b4ebff6a17..a104ba9df74 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 28522c0ac1c..12149bd962a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 305ced4dea7..4b427dc88b2 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6976249e9e9..c5a900f1b29 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3206,7 +3223,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3269,12 +3287,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7468961b017..1ef6c7216f4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6778,6 +6779,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6833,6 +6835,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6890,6 +6897,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index bb32de11ea0..a328f3aea6b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1775,6 +1797,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..2225cd0bf87 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..44cc028e82f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v10-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch (39.0K, 12-v10-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch)
  download | inline diff:
From a8918f42fba286e3355cca21ea440500ec8d4fed Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v10 05/11] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 263 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8144743c338..0f706553605 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 03590c98168..67cfdc4721a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index a531d37908a..e729b4a4d7c 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4676,7 +4674,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4794,17 +4792,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4830,6 +4835,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4849,7 +4856,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4860,7 +4867,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4869,6 +4877,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4877,7 +4887,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4894,6 +4905,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index fcb6e940ff2..73454accf61 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3293,9 +3293,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6c1fce8ed25..a02729911fe 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,8 +1670,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..eb47aaff566 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -30,6 +30,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -123,6 +124,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +351,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +394,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1524,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1534,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 66e1ad83f1a..0ecc3147bbd 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1799,9 +1799,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v10-0004-Allow-snapshot-resets-during-parallel-concurrent.patch (34.1K, 13-v10-0004-Allow-snapshot-resets-during-parallel-concurrent.patch)
  download | inline diff:
From ee8f8ee6b6778204a02370280f3b5437fdb04730 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v10 04/11] Allow snapshot resets during parallel concurrent
 index builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 49 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 13 files changed, 196 insertions(+), 67 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index bc18dbd2ab3..e8ff8fa0e8f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1244,7 +1243,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1259,6 +1257,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2359,7 +2358,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2390,25 +2388,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2448,8 +2446,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2474,7 +2470,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2520,7 +2517,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2536,6 +2532,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2544,7 +2547,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2567,9 +2571,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2769,14 +2770,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2798,6 +2799,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2938,6 +2940,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d9fce07e8ad..8144743c338 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 12149bd962a..03590c98168 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c5a900f1b29..fcb6e940ff2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 101a02c5b60..153ac28db3e 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8ca8f789617..d801aca82a5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a328f3aea6b..66e1ad83f1a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1180,7 +1180,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1798,9 +1799,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v10-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (17.5K, 14-v10-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From 0559ec7bc5c90aa868a0d4acf887b7e357cb26f6 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v10 01/11] This is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 6 files changed, 216 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c445c433df4..67befb6cba6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c31cc3ee69f..b4f9641e588 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6eb29b99735..101a02c5b60 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -388,6 +389,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
-- 
2.43.0



  [application/octet-stream] v10-0002-Add-stress-tests-for-concurrent-index-operations.patch (8.0K, 15-v10-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 132002c1a85b480b6c42ec052cd5a3c480fdc0d2 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v10 02/11] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 189 ++++++++++++++++++++++++++++++++
 2 files changed, 190 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..4a8f4fbc8b0 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..a9559dbe3af
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,189 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for  GIN/GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 4)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIN (ia);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIST (p);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING BRIN (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING HASH (updated_at);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-01-01 17:53  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  1 sibling, 0 replies; 33+ messages in thread

From: Michail Nikolaev @ 2025-01-01 17:53 UTC (permalink / raw)
  To: Michael Paquier <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello everyone,

My apologies, I probably forgot to attach the images with the benchmark
results in my previous email.

Please find them attached to this message.

Best regards,
Mikhail


Attachments:

  [image/png] image (1).png (34.1K, 3-image%20%281%29.png)
  download | view image

  [image/png] image.png (48.3K, 4-image.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-01-04 01:12  Matthias van de Meent <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  1 sibling, 1 reply; 33+ messages in thread

From: Matthias van de Meent @ 2025-01-04 01:12 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Wed, 1 Jan 2025 at 17:17, Michail Nikolaev
<[email protected]> wrote:
>
> Hello, everyone!
>
> I’ve added several updates to the patch set:
>
> * Automatic auxiliary index removal where applicable.
> * Documentation updates to reflect recent changes.
> * Optimization for STIR indexes: skipping datum setup, as they store only TIDs.
> * Numerous assertions to ensure that MyProc->xmin is invalid where necessary.
>
> I’d like to share some initial benchmark results (see attached graphs).
> This involves building a B-tree index on (aid, abalance) in a pgbench setup with scale 2000 (with WAL), while running a concurrent pgbench workload.
>
> The patched version built the index in 68 seconds, compared to 117 seconds with the master branch (mostly because of a single heap scan).
> There appears to be no effect on the throughput of the concurrent pgbench.
> The maximum snapshot age remains near zero.

Thank you for continuing working on this, these are some nice results.
I'm sorry I can't spend the time I want on this every time, but I
still think it's important this can eventually get committed, so thank
you for your work.

> (mostly because of a single heap scan).

Isn't there a second heap scan, or do you consider that an index scan?

> I am going to continue to benchmark with different options: different HOT setup, unique index, different index types and DB size (100+ GB).
> If someone has some ideas about possible benchmark scenarios - please share.

I think a good benchmark could show how bloat is actually prevented,
i.e. through result table size comparisons on an update-heavy
workload, both with and without the patch.
I think it shouldn't be too difficult to show how such workloads
quickly regress to always extending the table as no cleanup can
happen, while patched they'd have much more leeway due to page
pruning. Presumably a table with a fillfactor <100 would show the best
results.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)






^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-01-06 13:36  Michail Nikolaev <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2025-01-06 13:36 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, everyone!

Some benchmark results are ready. You can access them via [0] or check the
attachments. The benchmark code is available at [1].

A few words about the environments and tests:

There are two environments:
* local: AMD Ryzen 7 7700X (8-Core), 32GB RAM, local high-performance NVMe
SSD [2].
* io2: AWS t2.2xlarge, 8 vCPUs, 32GB RAM, 300GB io2 with 64,000 IOPS (the
fastest available).

There are few tests:
* btree_abalance - A basic new index on a frequently modified field
    query: CREATE INDEX CONCURRENTLY idx ON pgbench_accounts (abalance)
* btree_unique -  A simple unique index
    query: CREATE UNIQUE INDEX CONCURRENTLY idx ON pgbench_accounts (aid)
* btree_unique_hot - A unique index with multiple tuples sharing the same
value, caused by another index
    schema: CREATE INDEX idx2 ON pgbench_accounts (abalance)
    query: CREATE UNIQUE INDEX CONCURRENTLY idx ON pgbench_accounts (aid)
* brin - A basic BRIN index
    query: CREATE INDEX CONCURRENTLY idx ON pgbench_accounts USING
brin(abalance)
* hash - A basic hash index
    query: CREATE INDEX CONCURRENTLY idx ON pgbench_accounts USING hash(bid)
* gist - A simple GiST index
    schema: CREATE EXTENSION btree_gist
    query: CREATE INDEX CONCURRENTLY idx ON pgbench_accounts using
gist(abalance)
* gin - A simple GIN index
    schema: CREATE EXTENSION btree_gin
    query: CREATE INDEX CONCURRENTLY idx ON pgbench_accounts using
gin(abalance)

The tests were executed on the pgbench schema with a scale factor of 2000
(approximately 30GB) and a fill factor of 95.

Two types of concurrent loads were tested:
* IO-bound scenario: pgbench with 8 clients.
* CPU-bound scenario: pgbench with 50 clients.


As you can see, the index build time results are quite impressive—up to 4x
faster in some cases!

However, there’s something unusual with the GiST index. Occasionally,
sometimes it takes more time to build. I'll investigate that.

The auxiliary index size is relatively small, typically less than 1MB.

You can also observe the typical comparison results of TPS and oldest xmin
during index builds in the provided images (except for GiST, which shows
some anomalies).

>> (mostly because of a single heap scan).
> Isn't there a second heap scan, or do you consider that an index scan?

It is something between.

First phase: a regular heap scan is performed (with snapshot resetting).
Second phase: we collect all TIDs from target and auxiliary indexes, sort
them, and fetch from heap only records which are not present in the target
index (new tuples created during the first phase).

> I think a good benchmark could show how bloat is actually prevented,
> i.e. through result table size comparisons on an update-heavy
> workload, both with and without the patch.
> I think it shouldn't be too difficult to show how such workloads
> quickly regress to always extending the table as no cleanup can
> happen, while patched they'd have much more leeway due to page
> pruning. Presumably a table with a fillfactor <100 would show the best
> results.

I can’t see any significant differences from these tests so far. However, I
think this might be due to the random selection of tuples—there’s almost
always space available to place a new version on the same page.
I’ll try running the tests with a different distribution. Additionally, to
produce bloat comparable to a ~30GB table, updates will need to run for a
longer period.

Best regards,
Mikhail.

[0]:
https://docs.google.com/spreadsheets/d/1UYaqpsWSfYdZdQxaqY4gVo0RW6KrT0d-U1VDNJB8lVk/edit?usp=sharing
[1]:
https://gist.github.com/michail-nikolaev/b33fb0ac1f35729388c89f72db234b0f
[2]:
https://www.harddrivebenchmark.net/hdd.php?hdd=WD%20PC%20SN810%20SDCPNRZ%202TB&id=29324


Attachments:

  [application/pdf] PG benchmark 2 - summary.pdf (417.4K, 3-PG%20benchmark%202%20-%20summary.pdf)
  download

  [image/png] tps.png (68.8K, 4-tps.png)
  download | view image

  [image/png] graphs.png (108.8K, 5-graphs.png)
  download | view image

  [image/png] oldest_xmin_age.png (26.0K, 6-oldest_xmin_age.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-01-08 02:12  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2025-01-08 02:12 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, everyone!

Some updates:

* Rebased.
* Resolved the issue with integer overflow in memory calculation, which
caused a performance drop during sorting.
* Fixed a broken tag in the documentation.
* Added per-tuple progress tracking in the validation phase.

Additionally, the anomaly with the GIST index has been clarified.

It occurs because the first phase is slow, and many tuples need to be
inserted during the validation phase.
For each tuple, heapam_index_fetch_tuple is called, even for those on the
same page.
It might be possible to implement a batched version of
heapam_index_fetch_tuple to handle multiple tuples on the same page and
mitigate this issue.

Best regards,
Mikhail.


Attachments:

  [application/octet-stream] v11-0008-Concurrently-built-index-validation-uses-fresh-s.patch (15.9K, 3-v11-0008-Concurrently-built-index-validation-uses-fresh-s.patch)
  download | inline diff:
From 003233318e7e92cfd029e1ae2d7ab2959a251cb3 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:14:38 +0100
Subject: [PATCH v11 08/11] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.

The interval for replacing the snapshot is controlled by the `VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL` constant, which is currently set to 1000 milliseconds.
---
 doc/src/sgml/ref/create_index.sgml       | 11 ++++--
 doc/src/sgml/ref/reindex.sgml            | 11 +++---
 src/backend/access/heap/README.HOT       | 15 +++++---
 src/backend/access/heap/heapam_handler.c | 45 ++++++++++++++++++------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++++--
 src/backend/catalog/index.c              | 19 +++++++---
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 ++++++++
 9 files changed, 100 insertions(+), 32 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e33345f6a34..54566223cb0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -868,9 +868,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6a05620bd67..64c633e0398 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -495,10 +495,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d2fa463298b..e974f979b55 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1804,27 +1804,35 @@ heapam_index_validate_scan(Relation heapRelation,
 					fetched;
 	bool			tuplesort_empty = false,
 					auxtuplesort_empty = false;
+	instr_time		snapshotTime,
+					currentTime;
 
 	Assert(!HaveRegisteredOrActiveSnapshot());
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
+#define VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL	1000
 	/*
-	 * Now take the "reference snapshot" that will be used by to filter candidate
-	 * tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
 	 *
 	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
+	 * we mark the index as valid, for that reason limitX is supported.
 	 *
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
+	INSTR_TIME_SET_CURRENT(snapshotTime);
 	limitXmin = snapshot->xmin;
 
 	/*
@@ -1865,6 +1873,23 @@ heapam_index_validate_scan(Relation heapRelation,
 		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
+		INSTR_TIME_SET_CURRENT(currentTime);
+		INSTR_TIME_SUBTRACT(currentTime, snapshotTime);
+		if (INSTR_TIME_GET_MILLISEC(currentTime) >= VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+			INSTR_TIME_SET_CURRENT(snapshotTime);
+		}
+
 		/*
 		* Attempt to fetch the next TID from the auxiliary sort. If it's
 		* empty, we set auxindexcursor to NULL.
@@ -2007,7 +2032,7 @@ heapam_index_validate_scan(Relation heapRelation,
 	heapam_index_fetch_end(fetch);
 
 	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * Drop the latest snapshot.  We must do this before waiting out other
 	 * snapshot holders, else we will deadlock against other processes also
 	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
 	 * they must wait for.
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8b236c8ccd6..62e975016ad 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 894aefa19e1..6a6b1f8797b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -190,14 +190,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -811,7 +813,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -925,6 +926,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -965,6 +970,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0e06334f447..8aa6b0a2830 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3477,8 +3477,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3491,7 +3492,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3607,19 +3608,29 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
 											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/* Execute the sort */
 	{
 		const int	progress_index[] = {
@@ -3636,8 +3647,6 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	}
 	tuplesort_performsort(state.tuplesort);
 	tuplesort_performsort(auxState.tuplesort);
-
-	InvalidateCatalogSnapshot();
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index cd0d63ded82..e10f6098f58 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4354,7 +4354,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 0cab8653f1b..3d8db998c0b 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [application/octet-stream] v11-0010-Add-proper-handling-of-auxiliary-indexes-during-.patch (28.7K, 4-v11-0010-Add-proper-handling-of-auxiliary-indexes-during-.patch)
  download | inline diff:
From c115b13c01b3fd1670f46bc4983ecab44aaaebb1 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v11 10/11] Add proper handling of auxiliary indexes during
 DROP/REINDEX operations

During concurrent index operations, an auxiliary index may be created to help
with the process. In case of error during the building process (for example in case of index constraint violation) such indexes became junk-indexes without any function. This patch improves the handling of such auxiliary indexes:

* Add auxiliaryForIndexId parameter to index_create() to track dependencies
* Automatically drop auxiliary indexes when the main index is dropped
* Delete junk auxiliary indexes properly during REINDEX operations
* Add regression tests to verify new behaviour
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 54566223cb0..fb7cd15f5fe 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -661,10 +661,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 64c633e0398..c6db5d57167 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 096b68c7f39..1c2cfc94b54 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8aa6b0a2830..49e83155972 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -687,6 +687,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -733,6 +735,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -775,6 +778,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1176,6 +1181,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1458,6 +1472,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1608,6 +1623,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3829,6 +3845,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3885,6 +3902,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4173,7 +4203,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4262,13 +4293,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4294,18 +4342,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b98851a9e35..ab6dbd32d9f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1224,7 +1224,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3593,6 +3593,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 	} ReindexIndexInfo;
@@ -3941,6 +3942,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3948,6 +3950,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4010,12 +4013,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4025,6 +4033,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4045,10 +4054,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4205,7 +4222,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4224,6 +4242,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4406,6 +4427,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4451,6 +4474,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 4181c110eb7..e9b6ded6a55 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1492,6 +1492,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1552,9 +1554,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1606,6 +1619,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1634,12 +1675,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 01f85e57ea2..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 34331e4d48b..d858545dba3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3096,20 +3096,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index b410fa5c541..95e6f72fd4c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1273,11 +1273,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v11-0011-Updates-index-insert-and-value-computation-logic.patch (2.2K, 5-v11-0011-Updates-index-insert-and-value-computation-logic.patch)
  download | inline diff:
From f408d7fbdd750b88573726cf7c2de3d71170c2b4 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v11 11/11] Updates index insert and value computation logic to
 optimize auxiliary index handling.

* Skip index value computation for auxiliary indices since they are not needed
* Set indexUnchanged=false for auxiliary indices to avoid unnecessary checks
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 49e83155972..eaf08f4f66a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2929,6 +2929,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ae11c1dd463..d070f80795d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -434,11 +434,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v11-0009-Remove-PROC_IN_SAFE_IC-optimization.patch (20.6K, 6-v11-0009-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 93bd08b708769e33052ee34ffca9263915c1c365 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v11 09/11] Remove PROC_IN_SAFE_IC optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 8 files changed, 11 insertions(+), 233 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e580483a7cb..b4b36bda018 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2885,11 +2885,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 62e975016ad..1eb4299826e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e10f6098f58..b98851a9e35 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -116,7 +116,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -419,10 +418,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -443,8 +439,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -464,8 +459,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -579,7 +573,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1157,10 +1150,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1647,10 +1636,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1705,9 +1690,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1737,10 +1719,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1766,9 +1744,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1785,9 +1761,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1828,10 +1801,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1852,10 +1821,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3630,7 +3595,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -4002,17 +3966,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4072,7 +4025,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4165,11 +4117,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4200,10 +4147,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4212,11 +4155,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4241,10 +4179,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4264,11 +4198,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4289,10 +4218,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4325,10 +4250,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4356,9 +4277,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4380,13 +4298,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4442,12 +4353,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4509,12 +4414,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4774,36 +4673,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 20777f7d5ae..4bd24bc02d4 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 2225cd0bf87..b257a0344a8 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc cic_reset_snapshots
+REGRESS = injection_points cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index fb131270668..051b3e789c1 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -34,7 +34,6 @@ tests += {
   'regress': {
     'sql': [
       'injection_points',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v11-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch (102.0K, 7-v11-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch)
  download | inline diff:
From 456adb2a4111677e539799219992799fa6ce2b78 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v11 07/11] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 doc/src/sgml/monitoring.sgml                  |  26 +-
 doc/src/sgml/ref/create_index.sgml            |  33 +-
 doc/src/sgml/ref/reindex.sgml                 |  43 +-
 src/backend/access/heap/heapam.c              |   2 +-
 src/backend/access/heap/heapam_handler.c      | 368 ++++++++---------
 src/backend/catalog/index.c                   | 308 ++++++++++++--
 src/backend/catalog/system_views.sql          |  17 +-
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 376 ++++++++++++++----
 src/backend/nodes/makefuncs.c                 |   4 +-
 src/include/access/tableam.h                  |  28 +-
 src/include/catalog/index.h                   |  12 +-
 src/include/commands/progress.h               |  13 +-
 src/include/nodes/execnodes.h                 |   4 +-
 src/include/nodes/makefuncs.h                 |   3 +-
 .../expected/cic_reset_snapshots.out          |  28 ++
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |  42 ++
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/expected/rules.out           |  17 +-
 src/test/regress/sql/create_index.sql         |  21 +
 21 files changed, 968 insertions(+), 384 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d0d176cc54f..cf7a3bf5271 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6202,6 +6202,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6242,13 +6254,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6265,8 +6276,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 208389e8006..e33345f6a34 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -614,25 +614,24 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
-    significantly longer to complete.  However, since it allows normal
+    <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
+    This method requires more total work than a standard index build and takes
+    longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
     and I/O load imposed by the index creation might slow other operations.
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
+    In a concurrent index build, the main and auxiliary indexes is actually entered as an
     <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -645,10 +644,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -658,11 +658,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..6a05620bd67 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,11 +368,10 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
-    rebuild and takes significantly longer to complete as it needs to wait
+    rebuild and takes longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
     it allows normal operations to continue while the index is being rebuilt, this
     method is useful for rebuilding indexes in a production environment. Of
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fd3ceb754b0..f96845b11d0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -642,7 +642,7 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 	if (BufferIsValid(scan->rs_cbuf))
 	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
-#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 1024
 		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
 			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
 			heap_reset_scan_snapshot((TableScanDesc) scan);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bc3d3738ede..d2fa463298b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,253 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	IndexFetchTableData *fetch;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
 
 	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded,
+					fetched;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
+	/*
+	 * Now take the "reference snapshot" that will be used by to filter candidate
+	 * tuples.  Beware!  There might still be snapshots in
+	 * use that treat some transaction as in-progress that our reference
+	 * snapshot treats as committed.  If such a recently-committed transaction
+	 * deleted tuples in the table, we will not include them in the index; yet
+	 * those transactions which see the deleting one as still-in-progress will
+	 * expect such tuples to be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	fetch = heapam_index_fetch_begin(heapRelation);
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+	ItemPointerSetInvalid(&fetched);
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, (int64) state->itups);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
 	/*
-	 * Scan all tuples matching the snapshot.
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must be merged with or compared to those from
+	 * the "main" sort (state->tuplesort).
 	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while (!auxtuplesort_empty)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
-
+		Datum		ts_val;
+		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
-
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
-		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
-		}
-
 		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(auxState->tuplesort, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
 		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
+		else
 		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
+			auxindexcursor = NULL;
 		}
 
 		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
 			{
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
 			}
-		}
-
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
 
 			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
 			 */
-			if (predicate != NULL)
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
 			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
+				bool call_again = false;
+				bool all_dead = false;
+				ItemPointer tid;
 
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
+				/* Copy the auxindexcursor TID into fetched. */
+				fetched = *auxindexcursor;
+				tid = &fetched;
 
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				state->htups += 1;
 
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
+				/*
+				 * Fetch the tuple from the heap to see if it's visible
+				 * under our snapshot. If it is, form the index key values
+				 * and insert a new entry into the target index.
+				 */
+				if (heapam_index_fetch_tuple(fetch, tid, snapshot, slot, &call_again, &all_dead))
+				{
+
+					/* Compute the key values and null flags for this tuple. */
+					FormIndexDatum(indexInfo,
+								   slot,
+								   estate,
+								   values,
+								   isnull);
+
+					/*
+					 * Insert the tuple into the target index.
+					 */
+					index_insert(indexRelation,
+								 values,
+								 isnull,
+								 auxindexcursor, /* insert root tuple */
+								 heapRelation,
+								 indexInfo->ii_Unique ?
+								 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+								 false,
+								 indexInfo);
+
+					state->tups_inserted += 1;
+
+					elog(DEBUG5, "inserted tid: (%u,%u), root: (%u, %u)",
+											ItemPointerGetBlockNumber(auxindexcursor),
+											ItemPointerGetOffsetNumber(auxindexcursor),
+											ItemPointerGetBlockNumber(tid),
+											ItemPointerGetOffsetNumber(tid));
+				}
+				else
+				{
+					/*
+					 * The tuple wasn't visible under our snapshot. We
+					 * skip inserting it into the target index because
+					 * from our perspective, it doesn't exist.
+					 */
+					elog(DEBUG5, "skipping insert to target index because tid not visible: (%u,%u)",
+						 ItemPointerGetBlockNumber(auxindexcursor),
+						 ItemPointerGetOffsetNumber(auxindexcursor));
+				}
+			}
 		}
 	}
 
-	table_endscan(scan);
+	/* We may exit early due end of aux tuples, so, make sure we are done in the progress view */
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, (int64) state->itups);
 
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	heapam_index_fetch_end(fetch);
+
+	/*
+	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2dbf8f82141..0e06334f447 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -743,7 +748,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -754,11 +760,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +797,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1407,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1462,7 +1473,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1472,6 +1484,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2467,7 +2628,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2527,7 +2689,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3276,12 +3439,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3291,18 +3463,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (ut these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3310,12 +3485,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3333,22 +3510,24 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * rest for auxiliary */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3381,13 +3560,18 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3405,15 +3589,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   maintenance_work_mem - (int) main_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3436,27 +3635,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
+	/* Done with tuplesort objects */
 	tuplesort_end(state.tuplesort);
+	tuplesort_end(auxState.tuplesort);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3465,8 +3670,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3525,6 +3734,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3796,6 +4010,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4038,6 +4259,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4063,6 +4285,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7a595c84db9..0e4d977db87 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1265,16 +1265,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5921dcf68a1..cd0d63ded82 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -554,6 +557,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -563,6 +567,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -584,10 +589,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -834,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -929,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1227,7 +1242,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1569,6 +1585,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1597,11 +1623,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1611,7 +1637,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1650,7 +1676,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1662,15 +1688,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1698,43 +1748,31 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1757,12 +1795,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1787,6 +1825,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3542,6 +3627,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3647,8 +3733,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3700,8 +3793,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3762,6 +3862,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3865,15 +3972,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3924,6 +4034,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3937,12 +4052,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3951,6 +4071,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3969,10 +4090,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4053,13 +4178,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4102,24 +4269,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4134,13 +4329,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4152,16 +4340,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4181,7 +4361,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4271,14 +4451,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4303,6 +4483,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4316,11 +4518,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4340,6 +4542,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 694a2518ba5..4af3d3f7455 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -784,7 +784,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -800,6 +800,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -825,7 +826,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index d69baaa364f..d2060fce7cc 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,11 +714,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1862,22 +1862,22 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..01f85e57ea2 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 18e3179ef63..4c3ea686494 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -92,14 +92,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7bfe0acb91c..8ab74e2b1d9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -177,8 +177,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 8011c141bf8..34331e4d48b 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3028,6 +3029,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3040,8 +3042,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3069,6 +3073,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fef..e0a46c0a42a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2013,14 +2013,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 068c66b95a5..b410fa5c541 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1244,10 +1245,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1259,6 +1262,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v11-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch (37.3K, 8-v11-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch)
  download | inline diff:
From 482e98a887a444643e27111c417149a9efa4832e Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v11 06/11] Add STIR (Short-Term Index Replacement) access
 method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 576 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 780 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 09fab08b8e1..aaf55d689d2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..83aa255176f
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,576 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	/* Initialize contents of meta page */
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+	GenericXLogFinish(state);
+
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	GenericXLogState *state;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			state = GenericXLogStart(index);
+			page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				GenericXLogFinish(state);
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			/* Didn't fit, must try other pages */
+			GenericXLogAbort(state);
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		state = GenericXLogStart(index);
+		metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again /
+			 */
+			GenericXLogAbort(state);
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+
+			page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+			GenericXLogFinish(state);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	metaData = StirPageGetMeta(metaPage);
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		GenericXLogFinish(state);
+	}
+	else
+	{
+		GenericXLogAbort(state);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d937ba65c9c..2dbf8f82141 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3403,6 +3403,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2a7769b1fd1..f27d9041e2c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0d92e694d6a..a39d36c3539 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 6b66bc18286..694a2518ba5 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1be8739573f..44f8a0d5606 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 43445cdcc6c..26ddd5ec577 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b37e8a6f882..5ea2b12bf0a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b3f7aa299f5..7bfe0acb91c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v11-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch (43.2K, 9-v11-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch)
  download | inline diff:
From 67b2be8dc40832eac7fc3004803e93c8be49f8ea Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v11 03/11] Allow advancing xmin during non-unique,
 non-parallel concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  16 +++
 src/backend/access/gin/gininsert.c            |   3 +
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 19 files changed, 406 insertions(+), 34 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 7f7b55d902a..a026fbc692a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 9a984547578..c21608a6fd8 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1224,6 +1224,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1243,6 +1244,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2366,6 +2368,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2394,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2446,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2527,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2545,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8e1788dbcf7..97ef10c0098 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -21,6 +21,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -375,6 +376,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	/*
 	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
 	 * prefers to receive tuples in TID order.
@@ -423,6 +425,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	return result;
 }
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f950b9925f5..901aa667aa0 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -191,6 +191,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 485525f4d64..fd3ceb754b0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -568,6 +569,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -609,7 +640,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1236,6 +1273,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e817f8f8f84..580ec7f9aa8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 07bae342e25..0d262a4188d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7aba852db90..b490da0eeee 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 221fbb4e286..8c6dfecf515 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3206,7 +3223,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3269,12 +3287,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0ff498c4e14..c8e7880f954 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e92e108b6b6..a26e0832e38 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6778,6 +6779,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6833,6 +6835,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6890,6 +6897,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 09b9b394e0e..ec8928ad90b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1775,6 +1797,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..2225cd0bf87 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 989b4db226b..fb131270668 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v11-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch (39.0K, 10-v11-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch)
  download | inline diff:
From 4dfec74ca135e0371156d68cd303e8302a328e64 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v11 05/11] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 263 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3b3cbe571ac..bc3d3738ede 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 53363ee695a..f8976de6784 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 810f80fc8e6..8b236c8ccd6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 00e17a1f0f9..647f8e7b3af 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4684,7 +4682,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4802,17 +4800,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4838,6 +4843,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4857,7 +4864,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4868,7 +4875,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4877,6 +4885,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4885,7 +4895,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4902,6 +4913,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 707ff39ef40..d937ba65c9c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3293,9 +3293,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index c8e7880f954..5921dcf68a1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,8 +1670,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..0b25926bc56 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -30,6 +30,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -123,6 +124,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +351,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +394,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1524,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1534,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index b88bd443554..e756ad9b5b0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9a9b094f3f1..d69baaa364f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1799,9 +1799,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..76131b6f2e1 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v11-0004-Allow-snapshot-resets-during-parallel-concurrent.patch (34.1K, 11-v11-0004-Allow-snapshot-resets-during-parallel-concurrent.patch)
  download | inline diff:
From 0bf49a073a1bd89460374dd16dfe25c49a81ddaf Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v11 04/11] Allow snapshot resets during parallel concurrent
 index builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 49 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 13 files changed, 196 insertions(+), 67 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c21608a6fd8..e580483a7cb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1244,7 +1243,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1259,6 +1257,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2359,7 +2358,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2390,25 +2388,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2448,8 +2446,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2474,7 +2470,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2520,7 +2517,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2536,6 +2532,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2544,7 +2547,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2567,9 +2571,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2769,14 +2770,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2798,6 +2799,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2938,6 +2940,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 580ec7f9aa8..3b3cbe571ac 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b490da0eeee..810f80fc8e6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e18a8f8250f..b5b7be60a5e 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 7817bedc2ef..e9c0a46fd78 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1495,6 +1533,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8c6dfecf515..707ff39ef40 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index fa2d522b25f..ef4d0ae2fab 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3d018c3a1e8..4cd536e988c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 8811618acb7..f5cae39c85f 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index dc6e0184284..8529b808aed 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index ec8928ad90b..9a9b094f3f1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1180,7 +1180,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1798,9 +1799,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v11-0002-Add-stress-tests-for-concurrent-index-operations.patch (8.0K, 12-v11-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 6659bd291b5412de62ecdae76d8cac30f0f8487b Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v11 02/11] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 189 ++++++++++++++++++++++++++++++++
 2 files changed, 190 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..a9559dbe3af
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,189 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for  GIN/GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 4)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIN (ia);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIST (p);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING BRIN (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING HASH (updated_at);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v11-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (17.5K, 13-v11-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From e4e33536ec7137caedd31eea050589c8398cb800 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v11 01/11] This is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 6 files changed, 216 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d6e23caef17..0ff498c4e14 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 7c87f012c30..ae11c1dd463 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 7e71d422a62..3922ae39681 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1af8c9caf6c..8a1a085b106 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index b9759c31252..f91203dd353 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8f1508b1ee2..3d018c3a1e8 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -388,6 +389,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-01-18 14:18  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2025-01-18 14:18 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, everyone!

This is an updated version, contains some optimization into STIR index
access method, related to the fact it is never used with WAL.

>  The locking in stirinsert can probably be improved significantly if
>  we use things like atomic operations on STIR pages. We'd need an
>  exclusive lock only for page initialization, while share locks are
>  enough if the page's data is modified without WAL. That should improve
>  concurrent insert performance significantly, as it would further
>  reduce the length of the exclusively locked hot path.

Mathias, you were proposed to use just shared locking to writes, but how is
it possible if it is required to mark page as dirty, and it requires
exclusive lock?

> It occurs because the first phase is slow, and many tuples need to be
inserted during the validation phase.
> For each tuple, heapam_index_fetch_tuple is called, even for those on the
same page.
> It might be possible to implement a batched version of
heapam_index_fetch_tuple to handle multiple tuples on the same page and
mitigate this issue.

It was a wrong assumption. It looks like it is happening because of
prefetching. I'll try to add it in the validation phase.

Best regards,
Mikhail.


Attachments:

  [application/octet-stream] v12-0005-Allow-snapshot-resets-during-parallel-concurrent.patch (34.1K, 3-v12-0005-Allow-snapshot-resets-during-parallel-concurrent.patch)
  download | inline diff:
From 0bf49a073a1bd89460374dd16dfe25c49a81ddaf Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v12 05/12] Allow snapshot resets during parallel concurrent
 index builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 49 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 13 files changed, 196 insertions(+), 67 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c21608a6fd8..e580483a7cb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1244,7 +1243,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1259,6 +1257,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2359,7 +2358,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2390,25 +2388,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2448,8 +2446,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2474,7 +2470,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2520,7 +2517,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2536,6 +2532,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2544,7 +2547,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2567,9 +2571,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2769,14 +2770,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2798,6 +2799,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2938,6 +2940,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 580ec7f9aa8..3b3cbe571ac 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b490da0eeee..810f80fc8e6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e18a8f8250f..b5b7be60a5e 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 7817bedc2ef..e9c0a46fd78 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1495,6 +1533,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8c6dfecf515..707ff39ef40 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index fa2d522b25f..ef4d0ae2fab 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3d018c3a1e8..4cd536e988c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 8811618acb7..f5cae39c85f 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index dc6e0184284..8529b808aed 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index ec8928ad90b..9a9b094f3f1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1180,7 +1180,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1798,9 +1799,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v12-0004-Allow-advancing-xmin-during-non-unique-non-paral.patch (43.2K, 4-v12-0004-Allow-advancing-xmin-during-non-unique-non-paral.patch)
  download | inline diff:
From 67b2be8dc40832eac7fc3004803e93c8be49f8ea Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v12 04/12] Allow advancing xmin during non-unique,
 non-parallel concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  16 +++
 src/backend/access/gin/gininsert.c            |   3 +
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 19 files changed, 406 insertions(+), 34 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 7f7b55d902a..a026fbc692a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 9a984547578..c21608a6fd8 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1224,6 +1224,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1243,6 +1244,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2366,6 +2368,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2394,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2446,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2527,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2545,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8e1788dbcf7..97ef10c0098 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -21,6 +21,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -375,6 +376,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	/*
 	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
 	 * prefers to receive tuples in TID order.
@@ -423,6 +425,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	return result;
 }
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f950b9925f5..901aa667aa0 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -191,6 +191,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 485525f4d64..fd3ceb754b0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -568,6 +569,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -609,7 +640,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1236,6 +1273,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e817f8f8f84..580ec7f9aa8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 07bae342e25..0d262a4188d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7aba852db90..b490da0eeee 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 221fbb4e286..8c6dfecf515 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3206,7 +3223,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3269,12 +3287,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0ff498c4e14..c8e7880f954 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e92e108b6b6..a26e0832e38 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6778,6 +6779,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6833,6 +6835,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6890,6 +6897,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 09b9b394e0e..ec8928ad90b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1775,6 +1797,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..2225cd0bf87 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 989b4db226b..fb131270668 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v12-0003-Add-stress-tests-for-concurrent-index-operations.patch (8.0K, 5-v12-0003-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 6659bd291b5412de62ecdae76d8cac30f0f8487b Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v12 03/12] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 189 ++++++++++++++++++++++++++++++++
 2 files changed, 190 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..a9559dbe3af
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,189 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for  GIN/GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 4)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIN (ia);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIST (p);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING BRIN (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING HASH (updated_at);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v12-0002-This-is-https-commitfest.postgresql.org-50-5160-.patch (17.5K, 6-v12-0002-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From e4e33536ec7137caedd31eea050589c8398cb800 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v12 02/12] This is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 6 files changed, 216 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d6e23caef17..0ff498c4e14 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 7c87f012c30..ae11c1dd463 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 7e71d422a62..3922ae39681 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1af8c9caf6c..8a1a085b106 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index b9759c31252..f91203dd353 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8f1508b1ee2..3d018c3a1e8 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -388,6 +389,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
-- 
2.43.0



  [application/octet-stream] v12-0001-ExecInitAgg-update-aggstate-numaggs-and-numtrans.patch (1.7K, 7-v12-0001-ExecInitAgg-update-aggstate-numaggs-and-numtrans.patch)
  download | inline diff:
From 3f482940dbcbd15834a67894f4d9efdf5ceb7e16 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Tue, 7 Jan 2025 15:13:50 -0800
Subject: [PATCH v12 01/12] ExecInitAgg: update aggstate->numaggs and
 ->numtrans earlier.

Functions hash_agg_entry_size() and build_hash_tables() make use of
those values for memory size estimates.

Because this change only affects memory estimates, don't backpatch.

Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com
---
 src/backend/executor/nodeAgg.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 66cd4616963..3005b5c0e3b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -3379,8 +3379,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		max_aggno = Max(max_aggno, aggref->aggno);
 		max_transno = Max(max_transno, aggref->aggtransno);
 	}
-	numaggs = max_aggno + 1;
-	numtrans = max_transno + 1;
+	aggstate->numaggs = numaggs = max_aggno + 1;
+	aggstate->numtrans = numtrans = max_transno + 1;
 
 	/*
 	 * For each phase, prepare grouping set data and fmgr lookup data for
@@ -3943,13 +3943,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		ReleaseSysCache(aggTuple);
 	}
 
-	/*
-	 * Update aggstate->numaggs to be the number of unique aggregates found.
-	 * Also set numstates to the number of unique transition states found.
-	 */
-	aggstate->numaggs = numaggs;
-	aggstate->numtrans = numtrans;
-
 	/*
 	 * Last, check whether any more aggregates got added onto the node while
 	 * we processed the expressions for the aggregate arguments (including not
-- 
2.43.0



  [application/octet-stream] v12-0007-Add-STIR-Short-Term-Index-Replacement-access-met.patch (37.0K, 8-v12-0007-Add-STIR-Short-Term-Index-Replacement-access-met.patch)
  download | inline diff:
From 538115eb152d72d10c2cbe9a62a40ddec22236af Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v12 07/12] Add STIR (Short-Term Index Replacement) access
 method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 09fab08b8e1..aaf55d689d2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..b844bcb21d7
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d937ba65c9c..2dbf8f82141 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3403,6 +3403,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2a7769b1fd1..f27d9041e2c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0d92e694d6a..a39d36c3539 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 6b66bc18286..694a2518ba5 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1be8739573f..44f8a0d5606 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 43445cdcc6c..26ddd5ec577 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b37e8a6f882..5ea2b12bf0a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b3f7aa299f5..7bfe0acb91c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v12-0006-Allow-snapshot-resets-in-concurrent-unique-index.patch (39.0K, 9-v12-0006-Allow-snapshot-resets-in-concurrent-unique-index.patch)
  download | inline diff:
From 4dfec74ca135e0371156d68cd303e8302a328e64 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v12 06/12] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 263 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3b3cbe571ac..bc3d3738ede 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 53363ee695a..f8976de6784 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 810f80fc8e6..8b236c8ccd6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 00e17a1f0f9..647f8e7b3af 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4684,7 +4682,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4802,17 +4800,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4838,6 +4843,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4857,7 +4864,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4868,7 +4875,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4877,6 +4885,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4885,7 +4895,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4902,6 +4913,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 707ff39ef40..d937ba65c9c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3293,9 +3293,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index c8e7880f954..5921dcf68a1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,8 +1670,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..0b25926bc56 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -30,6 +30,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -123,6 +124,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +351,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +394,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1524,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1534,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index b88bd443554..e756ad9b5b0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9a9b094f3f1..d69baaa364f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1799,9 +1799,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..76131b6f2e1 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v12-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch (102.0K, 10-v12-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch)
  download | inline diff:
From d27077f0412566da22670bff3790ce0af65ae4fe Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v12 08/12] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 doc/src/sgml/monitoring.sgml                  |  26 +-
 doc/src/sgml/ref/create_index.sgml            |  33 +-
 doc/src/sgml/ref/reindex.sgml                 |  43 +-
 src/backend/access/heap/heapam.c              |   2 +-
 src/backend/access/heap/heapam_handler.c      | 368 ++++++++---------
 src/backend/catalog/index.c                   | 308 ++++++++++++--
 src/backend/catalog/system_views.sql          |  17 +-
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 376 ++++++++++++++----
 src/backend/nodes/makefuncs.c                 |   4 +-
 src/include/access/tableam.h                  |  28 +-
 src/include/catalog/index.h                   |  12 +-
 src/include/commands/progress.h               |  13 +-
 src/include/nodes/execnodes.h                 |   4 +-
 src/include/nodes/makefuncs.h                 |   3 +-
 .../expected/cic_reset_snapshots.out          |  28 ++
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |  42 ++
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/expected/rules.out           |  17 +-
 src/test/regress/sql/create_index.sql         |  21 +
 21 files changed, 968 insertions(+), 384 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d0d176cc54f..cf7a3bf5271 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6202,6 +6202,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6242,13 +6254,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6265,8 +6276,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 208389e8006..e33345f6a34 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -614,25 +614,24 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
-    significantly longer to complete.  However, since it allows normal
+    <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
+    This method requires more total work than a standard index build and takes
+    longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
     and I/O load imposed by the index creation might slow other operations.
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
+    In a concurrent index build, the main and auxiliary indexes is actually entered as an
     <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -645,10 +644,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -658,11 +658,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..6a05620bd67 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,11 +368,10 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
-    rebuild and takes significantly longer to complete as it needs to wait
+    rebuild and takes longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
     it allows normal operations to continue while the index is being rebuilt, this
     method is useful for rebuilding indexes in a production environment. Of
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fd3ceb754b0..f96845b11d0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -642,7 +642,7 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 	if (BufferIsValid(scan->rs_cbuf))
 	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
-#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 1024
 		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
 			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
 			heap_reset_scan_snapshot((TableScanDesc) scan);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bc3d3738ede..d2fa463298b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,253 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	IndexFetchTableData *fetch;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
 
 	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded,
+					fetched;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
+	/*
+	 * Now take the "reference snapshot" that will be used by to filter candidate
+	 * tuples.  Beware!  There might still be snapshots in
+	 * use that treat some transaction as in-progress that our reference
+	 * snapshot treats as committed.  If such a recently-committed transaction
+	 * deleted tuples in the table, we will not include them in the index; yet
+	 * those transactions which see the deleting one as still-in-progress will
+	 * expect such tuples to be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	fetch = heapam_index_fetch_begin(heapRelation);
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+	ItemPointerSetInvalid(&fetched);
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, (int64) state->itups);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
 	/*
-	 * Scan all tuples matching the snapshot.
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must be merged with or compared to those from
+	 * the "main" sort (state->tuplesort).
 	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while (!auxtuplesort_empty)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
-
+		Datum		ts_val;
+		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
-
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
-		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
-		}
-
 		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(auxState->tuplesort, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
 		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
+		else
 		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
+			auxindexcursor = NULL;
 		}
 
 		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
 			{
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
 			}
-		}
-
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
 
 			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
 			 */
-			if (predicate != NULL)
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
 			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
+				bool call_again = false;
+				bool all_dead = false;
+				ItemPointer tid;
 
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
+				/* Copy the auxindexcursor TID into fetched. */
+				fetched = *auxindexcursor;
+				tid = &fetched;
 
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				state->htups += 1;
 
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
+				/*
+				 * Fetch the tuple from the heap to see if it's visible
+				 * under our snapshot. If it is, form the index key values
+				 * and insert a new entry into the target index.
+				 */
+				if (heapam_index_fetch_tuple(fetch, tid, snapshot, slot, &call_again, &all_dead))
+				{
+
+					/* Compute the key values and null flags for this tuple. */
+					FormIndexDatum(indexInfo,
+								   slot,
+								   estate,
+								   values,
+								   isnull);
+
+					/*
+					 * Insert the tuple into the target index.
+					 */
+					index_insert(indexRelation,
+								 values,
+								 isnull,
+								 auxindexcursor, /* insert root tuple */
+								 heapRelation,
+								 indexInfo->ii_Unique ?
+								 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+								 false,
+								 indexInfo);
+
+					state->tups_inserted += 1;
+
+					elog(DEBUG5, "inserted tid: (%u,%u), root: (%u, %u)",
+											ItemPointerGetBlockNumber(auxindexcursor),
+											ItemPointerGetOffsetNumber(auxindexcursor),
+											ItemPointerGetBlockNumber(tid),
+											ItemPointerGetOffsetNumber(tid));
+				}
+				else
+				{
+					/*
+					 * The tuple wasn't visible under our snapshot. We
+					 * skip inserting it into the target index because
+					 * from our perspective, it doesn't exist.
+					 */
+					elog(DEBUG5, "skipping insert to target index because tid not visible: (%u,%u)",
+						 ItemPointerGetBlockNumber(auxindexcursor),
+						 ItemPointerGetOffsetNumber(auxindexcursor));
+				}
+			}
 		}
 	}
 
-	table_endscan(scan);
+	/* We may exit early due end of aux tuples, so, make sure we are done in the progress view */
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, (int64) state->itups);
 
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	heapam_index_fetch_end(fetch);
+
+	/*
+	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2dbf8f82141..0e06334f447 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -743,7 +748,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -754,11 +760,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +797,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1407,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1462,7 +1473,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1472,6 +1484,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2467,7 +2628,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2527,7 +2689,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3276,12 +3439,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3291,18 +3463,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (ut these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3310,12 +3485,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3333,22 +3510,24 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * rest for auxiliary */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3381,13 +3560,18 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3405,15 +3589,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   maintenance_work_mem - (int) main_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3436,27 +3635,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
+	/* Done with tuplesort objects */
 	tuplesort_end(state.tuplesort);
+	tuplesort_end(auxState.tuplesort);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3465,8 +3670,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3525,6 +3734,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3796,6 +4010,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4038,6 +4259,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4063,6 +4285,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7a595c84db9..0e4d977db87 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1265,16 +1265,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5921dcf68a1..cd0d63ded82 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -554,6 +557,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -563,6 +567,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -584,10 +589,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -834,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -929,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1227,7 +1242,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1569,6 +1585,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1597,11 +1623,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1611,7 +1637,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1650,7 +1676,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1662,15 +1688,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1698,43 +1748,31 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1757,12 +1795,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1787,6 +1825,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3542,6 +3627,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3647,8 +3733,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3700,8 +3793,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3762,6 +3862,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3865,15 +3972,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3924,6 +4034,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3937,12 +4052,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3951,6 +4071,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3969,10 +4090,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4053,13 +4178,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4102,24 +4269,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4134,13 +4329,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4152,16 +4340,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4181,7 +4361,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4271,14 +4451,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4303,6 +4483,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4316,11 +4518,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4340,6 +4542,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 694a2518ba5..4af3d3f7455 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -784,7 +784,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -800,6 +800,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -825,7 +826,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index d69baaa364f..d2060fce7cc 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,11 +714,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1862,22 +1862,22 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..01f85e57ea2 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 18e3179ef63..4c3ea686494 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -92,14 +92,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7bfe0acb91c..8ab74e2b1d9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -177,8 +177,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 8011c141bf8..34331e4d48b 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3028,6 +3029,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3040,8 +3042,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3069,6 +3073,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fef..e0a46c0a42a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2013,14 +2013,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 068c66b95a5..b410fa5c541 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1244,10 +1245,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1259,6 +1262,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v12-0009-Concurrently-built-index-validation-uses-fresh-s.patch (15.9K, 11-v12-0009-Concurrently-built-index-validation-uses-fresh-s.patch)
  download | inline diff:
From f59f5681b72d3d30d4ebb0aeb7d2a84fbf8a7f29 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:14:38 +0100
Subject: [PATCH v12 09/12] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.

The interval for replacing the snapshot is controlled by the `VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL` constant, which is currently set to 1000 milliseconds.
---
 doc/src/sgml/ref/create_index.sgml       | 11 ++++--
 doc/src/sgml/ref/reindex.sgml            | 11 +++---
 src/backend/access/heap/README.HOT       | 15 +++++---
 src/backend/access/heap/heapam_handler.c | 45 ++++++++++++++++++------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++++--
 src/backend/catalog/index.c              | 19 +++++++---
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 ++++++++
 9 files changed, 100 insertions(+), 32 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e33345f6a34..54566223cb0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -868,9 +868,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6a05620bd67..64c633e0398 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -495,10 +495,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d2fa463298b..e974f979b55 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1804,27 +1804,35 @@ heapam_index_validate_scan(Relation heapRelation,
 					fetched;
 	bool			tuplesort_empty = false,
 					auxtuplesort_empty = false;
+	instr_time		snapshotTime,
+					currentTime;
 
 	Assert(!HaveRegisteredOrActiveSnapshot());
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
+#define VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL	1000
 	/*
-	 * Now take the "reference snapshot" that will be used by to filter candidate
-	 * tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
 	 *
 	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
+	 * we mark the index as valid, for that reason limitX is supported.
 	 *
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
+	INSTR_TIME_SET_CURRENT(snapshotTime);
 	limitXmin = snapshot->xmin;
 
 	/*
@@ -1865,6 +1873,23 @@ heapam_index_validate_scan(Relation heapRelation,
 		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
+		INSTR_TIME_SET_CURRENT(currentTime);
+		INSTR_TIME_SUBTRACT(currentTime, snapshotTime);
+		if (INSTR_TIME_GET_MILLISEC(currentTime) >= VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+			INSTR_TIME_SET_CURRENT(snapshotTime);
+		}
+
 		/*
 		* Attempt to fetch the next TID from the auxiliary sort. If it's
 		* empty, we set auxindexcursor to NULL.
@@ -2007,7 +2032,7 @@ heapam_index_validate_scan(Relation heapRelation,
 	heapam_index_fetch_end(fetch);
 
 	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * Drop the latest snapshot.  We must do this before waiting out other
 	 * snapshot holders, else we will deadlock against other processes also
 	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
 	 * they must wait for.
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8b236c8ccd6..62e975016ad 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 894aefa19e1..6a6b1f8797b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -190,14 +190,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -811,7 +813,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -925,6 +926,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -965,6 +970,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0e06334f447..8aa6b0a2830 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3477,8 +3477,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3491,7 +3492,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3607,19 +3608,29 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
 											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
 
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/* Execute the sort */
 	{
 		const int	progress_index[] = {
@@ -3636,8 +3647,6 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	}
 	tuplesort_performsort(state.tuplesort);
 	tuplesort_performsort(auxState.tuplesort);
-
-	InvalidateCatalogSnapshot();
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index cd0d63ded82..e10f6098f58 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4354,7 +4354,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 0cab8653f1b..3d8db998c0b 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [application/octet-stream] v12-0010-Remove-PROC_IN_SAFE_IC-optimization.patch (20.6K, 12-v12-0010-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From ba5dc62cdc7e5fa48f38fc3ad524ace7edf1d450 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v12 10/12] Remove PROC_IN_SAFE_IC optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 8 files changed, 11 insertions(+), 233 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e580483a7cb..b4b36bda018 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2885,11 +2885,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 62e975016ad..1eb4299826e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e10f6098f58..b98851a9e35 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -116,7 +116,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -419,10 +418,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -443,8 +439,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -464,8 +459,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -579,7 +573,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1157,10 +1150,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1647,10 +1636,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1705,9 +1690,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1737,10 +1719,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1766,9 +1744,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1785,9 +1761,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1828,10 +1801,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1852,10 +1821,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3630,7 +3595,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -4002,17 +3966,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4072,7 +4025,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4165,11 +4117,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4200,10 +4147,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4212,11 +4155,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4241,10 +4179,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4264,11 +4198,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4289,10 +4218,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4325,10 +4250,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4356,9 +4277,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4380,13 +4298,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4442,12 +4353,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4509,12 +4414,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4774,36 +4673,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 20777f7d5ae..4bd24bc02d4 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 2225cd0bf87..b257a0344a8 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc cic_reset_snapshots
+REGRESS = injection_points cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index fb131270668..051b3e789c1 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -34,7 +34,6 @@ tests += {
   'regress': {
     'sql': [
       'injection_points',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v12-0012-Updates-index-insert-and-value-computation-logic.patch (2.2K, 13-v12-0012-Updates-index-insert-and-value-computation-logic.patch)
  download | inline diff:
From 63fa0c329b4cbbcaefcb03ee7d7f67c19ffbdec3 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v12 12/12] Updates index insert and value computation logic to
 optimize auxiliary index handling.

* Skip index value computation for auxiliary indices since they are not needed
* Set indexUnchanged=false for auxiliary indices to avoid unnecessary checks
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 49e83155972..eaf08f4f66a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2929,6 +2929,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ae11c1dd463..d070f80795d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -434,11 +434,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v12-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch (28.7K, 14-v12-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch)
  download | inline diff:
From 3eef11ae2dbdc7c6df349c1c0b72089495e68fe4 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v12 11/12] Add proper handling of auxiliary indexes during
 DROP/REINDEX operations

During concurrent index operations, an auxiliary index may be created to help
with the process. In case of error during the building process (for example in case of index constraint violation) such indexes became junk-indexes without any function. This patch improves the handling of such auxiliary indexes:

* Add auxiliaryForIndexId parameter to index_create() to track dependencies
* Automatically drop auxiliary indexes when the main index is dropped
* Delete junk auxiliary indexes properly during REINDEX operations
* Add regression tests to verify new behaviour
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 54566223cb0..fb7cd15f5fe 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -661,10 +661,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 64c633e0398..c6db5d57167 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 096b68c7f39..1c2cfc94b54 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8aa6b0a2830..49e83155972 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -687,6 +687,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -733,6 +735,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -775,6 +778,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1176,6 +1181,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1458,6 +1472,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1608,6 +1623,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3829,6 +3845,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3885,6 +3902,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4173,7 +4203,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4262,13 +4293,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4294,18 +4342,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b98851a9e35..ab6dbd32d9f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1224,7 +1224,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3593,6 +3593,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 	} ReindexIndexInfo;
@@ -3941,6 +3942,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3948,6 +3950,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4010,12 +4013,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4025,6 +4033,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4045,10 +4054,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4205,7 +4222,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4224,6 +4242,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4406,6 +4427,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4451,6 +4474,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 4181c110eb7..e9b6ded6a55 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1492,6 +1492,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1552,9 +1554,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1606,6 +1619,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1634,12 +1675,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 01f85e57ea2..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 34331e4d48b..d858545dba3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3096,20 +3096,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index b410fa5c541..95e6f72fd4c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1273,11 +1273,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-01-30 01:00  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2025-01-30 01:00 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, everyone!

> It was a wrong assumption. It looks like it is happening because of
prefetching. I'll try to add it in the validation phase.
This is an updated patch set, now prefetching is implemented.

Not validation works that way:
1) TIDs which are present in STIR auxiliary index but not present in target
index are loaded into tuplestore in sorted way
2) Then tuples from tuplestore are fetched one by one, but with underlying
prefetching of corresponding pages

Benchmark setups are the same as in [0].
Results show it works really well (see attachments).
I was unable to achieve consistent results for a few tests on the AWS (io2)
environment (and it was costly :) )

So, my next plan is:
1) wait a little bit for some comments from someone who still watches that
1-year going mainly solo thread :)
2) prepare a fresh new letter with patches, explanation, benchmark results
and so on.

Best regards,
Mikhail.

[0]:
https://www.postgresql.org/message-id/flat/CANtu0ojHAputNCH73TEYN_RUtjLGYsEyW1aSXmsXyvwf%3D3U4qQ%40m...

>


Attachments:

  [application/octet-stream] v13-0001-Add-stress-tests-for-concurrent-index-operations.patch (8.0K, 3-v13-0001-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 6659bd291b5412de62ecdae76d8cac30f0f8487b Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v13 01/11] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 189 ++++++++++++++++++++++++++++++++
 2 files changed, 190 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..a9559dbe3af
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,189 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for  GIN/GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 4)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIN (ia);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIST (p);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING BRIN (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING HASH (updated_at);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v13-0005-Add-STIR-Short-Term-Index-Replacement-access-met.patch (37.0K, 4-v13-0005-Add-STIR-Short-Term-Index-Replacement-access-met.patch)
  download | inline diff:
From 58c6c83c35bb44161c4500995ad413fce0938b3c Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v13 05/11] Add STIR (Short-Term Index Replacement) access
 method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 09fab08b8e1..aaf55d689d2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..b844bcb21d7
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d937ba65c9c..2dbf8f82141 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3403,6 +3403,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2a7769b1fd1..f27d9041e2c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0d92e694d6a..a39d36c3539 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 6b66bc18286..694a2518ba5 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1be8739573f..44f8a0d5606 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 43445cdcc6c..26ddd5ec577 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b37e8a6f882..5ea2b12bf0a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b3f7aa299f5..7bfe0acb91c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v13-0003-Allow-snapshot-resets-during-parallel-concurrent.patch (34.1K, 5-v13-0003-Allow-snapshot-resets-during-parallel-concurrent.patch)
  download | inline diff:
From 4ed1282bea6a0515f9e91421da46d88688075305 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v13 03/11] Allow snapshot resets during parallel concurrent
 index builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 49 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 13 files changed, 196 insertions(+), 67 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c21608a6fd8..e580483a7cb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1244,7 +1243,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1259,6 +1257,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2359,7 +2358,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2390,25 +2388,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2448,8 +2446,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2474,7 +2470,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2520,7 +2517,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2536,6 +2532,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2544,7 +2547,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2567,9 +2571,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2769,14 +2770,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2798,6 +2799,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2938,6 +2940,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 580ec7f9aa8..3b3cbe571ac 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b490da0eeee..810f80fc8e6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e18a8f8250f..b5b7be60a5e 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 7817bedc2ef..e9c0a46fd78 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1495,6 +1533,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8c6dfecf515..707ff39ef40 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index fa2d522b25f..ef4d0ae2fab 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3d018c3a1e8..4cd536e988c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 8811618acb7..f5cae39c85f 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index dc6e0184284..8529b808aed 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index ec8928ad90b..9a9b094f3f1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1180,7 +1180,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1798,9 +1799,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v13-0004-Allow-snapshot-resets-in-concurrent-unique-index.patch (39.0K, 6-v13-0004-Allow-snapshot-resets-in-concurrent-unique-index.patch)
  download | inline diff:
From 31f62e7cd67cb02449ed02e5cf7d5d489ad7f20f Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v13 04/11] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 263 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3b3cbe571ac..bc3d3738ede 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 53363ee695a..f8976de6784 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 810f80fc8e6..8b236c8ccd6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 00e17a1f0f9..647f8e7b3af 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4684,7 +4682,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4802,17 +4800,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4838,6 +4843,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4857,7 +4864,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4868,7 +4875,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4877,6 +4885,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4885,7 +4895,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4902,6 +4913,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 707ff39ef40..d937ba65c9c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3293,9 +3293,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index c8e7880f954..5921dcf68a1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,8 +1670,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..0b25926bc56 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -30,6 +30,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -123,6 +124,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +351,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +394,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1524,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1534,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index b88bd443554..e756ad9b5b0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9a9b094f3f1..d69baaa364f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1799,9 +1799,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..76131b6f2e1 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v13-0002-Allow-advancing-xmin-during-non-unique-non-paral.patch (43.6K, 7-v13-0002-Allow-advancing-xmin-during-non-unique-non-paral.patch)
  download | inline diff:
From 07354569b88ef5bde90c7f56fdacdc6821891f69 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v13 02/11] Allow advancing xmin during non-unique,
 non-parallel concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  16 +++
 src/backend/access/gin/gininsert.c            |   3 +
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 407 insertions(+), 34 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 7f7b55d902a..a026fbc692a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 9a984547578..c21608a6fd8 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1224,6 +1224,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1243,6 +1244,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2366,6 +2368,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2394,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2446,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2527,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2545,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8e1788dbcf7..97ef10c0098 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -21,6 +21,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -375,6 +376,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	/*
 	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
 	 * prefers to receive tuples in TID order.
@@ -423,6 +425,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	return result;
 }
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f950b9925f5..901aa667aa0 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -191,6 +191,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 485525f4d64..86286dc89c3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -568,6 +569,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -609,7 +640,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1236,6 +1272,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e817f8f8f84..580ec7f9aa8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 07bae342e25..0d262a4188d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7aba852db90..b490da0eeee 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 221fbb4e286..8c6dfecf515 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3206,7 +3223,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3269,12 +3287,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0ff498c4e14..c8e7880f954 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e92e108b6b6..a26e0832e38 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6778,6 +6779,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6833,6 +6835,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6890,6 +6897,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7d06dad83fc..43bdf62b944 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 09b9b394e0e..ec8928ad90b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1775,6 +1797,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..2225cd0bf87 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 989b4db226b..fb131270668 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v13-0006-tuplestore-add-support-for-storing-Datum-values.patch (17.3K, 8-v13-0006-tuplestore-add-support-for-storing-Datum-values.patch)
  download | inline diff:
From 812b58f910119ccec6c0023fa0f7a89d49c07867 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v13 06/11] tuplestore: add support for storing Datum values

Add ability to store and retrieve individual Datum values in tuplestore, optimizing storage based on type:

- Fixed-length: stores raw bytes without length prefix
- Variable-length: includes length prefix/suffix
- By-value types handled inline

This extends tuplestore beyond just handling tuples, planned to be used in next patch.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index aacec8b7993..4ed13da6046 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * 1024L;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index ed7c454f44e..1f431863387 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v13-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch (110.0K, 9-v13-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch)
  download | inline diff:
From d759095ecd9466966fc3d1e20f6dc294e44c7419 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v13 07/11] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 doc/src/sgml/monitoring.sgml                  |  26 +-
 doc/src/sgml/ref/create_index.sgml            |  33 +-
 doc/src/sgml/ref/reindex.sgml                 |  43 +-
 src/backend/access/heap/README.HOT            |  15 +-
 src/backend/access/heap/heapam_handler.c      | 593 ++++++++++++------
 src/backend/catalog/index.c                   | 312 +++++++--
 src/backend/catalog/system_views.sql          |  17 +-
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 376 ++++++++---
 src/backend/nodes/makefuncs.c                 |   4 +-
 src/include/access/tableam.h                  |  31 +-
 src/include/catalog/index.h                   |  12 +-
 src/include/commands/progress.h               |  13 +-
 src/include/nodes/execnodes.h                 |   4 +-
 src/include/nodes/makefuncs.h                 |   3 +-
 .../expected/cic_reset_snapshots.out          |  28 +
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |  42 ++
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/expected/rules.out           |  17 +-
 src/test/regress/sql/create_index.sql         |  21 +
 21 files changed, 1195 insertions(+), 402 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d0d176cc54f..cf7a3bf5271 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6202,6 +6202,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6242,13 +6254,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6265,8 +6276,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 208389e8006..e33345f6a34 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -614,25 +614,24 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
-    significantly longer to complete.  However, since it allows normal
+    <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
+    This method requires more total work than a standard index build and takes
+    longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
     and I/O load imposed by the index creation might slow other operations.
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
+    In a concurrent index build, the main and auxiliary indexes is actually entered as an
     <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -645,10 +644,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -658,11 +658,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..6a05620bd67 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,11 +368,10 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
-    rebuild and takes significantly longer to complete as it needs to wait
+    rebuild and takes longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
     it allows normal operations to continue while the index is being rebuilt, this
     method is useful for rebuilding indexes in a production environment. Of
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bc3d3738ede..96c04e9add7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,452 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded,
+					fetched;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+	ItemPointerSetInvalid(&fetched);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Now take the snapshot that will be used by to filter candidate
+	 * tuples.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
+
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2dbf8f82141..3a89d18505c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -743,7 +748,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -754,11 +760,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +797,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1407,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1462,7 +1473,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1472,6 +1484,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2467,7 +2628,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2527,7 +2689,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3276,12 +3439,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3291,18 +3463,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (ut these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3310,12 +3485,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3333,22 +3510,27 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3381,12 +3563,16 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
 
 	/* mark build is concurrent just for consistency */
@@ -3405,15 +3591,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3436,27 +3637,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3465,8 +3672,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3525,6 +3736,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3796,6 +4012,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4038,6 +4261,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4063,6 +4287,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7a595c84db9..0e4d977db87 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1265,16 +1265,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5921dcf68a1..cd0d63ded82 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -554,6 +557,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -563,6 +567,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -584,10 +589,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -834,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -929,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1227,7 +1242,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1569,6 +1585,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1597,11 +1623,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1611,7 +1637,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1650,7 +1676,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1662,15 +1688,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1698,43 +1748,31 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1757,12 +1795,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1787,6 +1825,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3542,6 +3627,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3647,8 +3733,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3700,8 +3793,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3762,6 +3862,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3865,15 +3972,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3924,6 +4034,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3937,12 +4052,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3951,6 +4071,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3969,10 +4090,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4053,13 +4178,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4102,24 +4269,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4134,13 +4329,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4152,16 +4340,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4181,7 +4361,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4271,14 +4451,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4303,6 +4483,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4316,11 +4518,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4340,6 +4542,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 694a2518ba5..4af3d3f7455 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -784,7 +784,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -800,6 +800,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -825,7 +826,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index d69baaa364f..e2c0fc8fd66 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,11 +714,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1862,22 +1862,25 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both state and auxstate.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..01f85e57ea2 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 18e3179ef63..4c3ea686494 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -92,14 +92,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7bfe0acb91c..8ab74e2b1d9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -177,8 +177,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 8011c141bf8..34331e4d48b 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3028,6 +3029,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3040,8 +3042,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3069,6 +3073,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fef..e0a46c0a42a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2013,14 +2013,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 068c66b95a5..b410fa5c541 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1244,10 +1245,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1259,6 +1262,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v13-0009-Remove-PROC_IN_SAFE_IC-optimization.patch (20.6K, 10-v13-0009-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 519bfd03f076a339a77462bfcab13d2cac5f8f33 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v13 09/11] Remove PROC_IN_SAFE_IC optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 8 files changed, 11 insertions(+), 233 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e580483a7cb..b4b36bda018 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2885,11 +2885,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 62e975016ad..1eb4299826e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e10f6098f58..b98851a9e35 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -116,7 +116,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -419,10 +418,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -443,8 +439,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -464,8 +459,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -579,7 +573,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1157,10 +1150,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1647,10 +1636,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1705,9 +1690,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1737,10 +1719,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1766,9 +1744,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1785,9 +1761,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1828,10 +1801,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1852,10 +1821,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3630,7 +3595,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -4002,17 +3966,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4072,7 +4025,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4165,11 +4117,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4200,10 +4147,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4212,11 +4155,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4241,10 +4179,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4264,11 +4198,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4289,10 +4218,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4325,10 +4250,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4356,9 +4277,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4380,13 +4298,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4442,12 +4353,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4509,12 +4414,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4774,36 +4673,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 20777f7d5ae..4bd24bc02d4 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 2225cd0bf87..b257a0344a8 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc cic_reset_snapshots
+REGRESS = injection_points cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index fb131270668..051b3e789c1 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -34,7 +34,6 @@ tests += {
   'regress': {
     'sql': [
       'injection_points',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v13-0008-Concurrently-built-index-validation-uses-fresh-s.patch (14.1K, 11-v13-0008-Concurrently-built-index-validation-uses-fresh-s.patch)
  download | inline diff:
From 068c0a43320c1f4a0f9c63b9e6db57c280067620 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 17:21:29 +0100
Subject: [PATCH v13 08/11] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/heapam_handler.c | 77 +++++++++++++++---------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 14 +++--
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 +++++
 8 files changed, 97 insertions(+), 47 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e33345f6a34..54566223cb0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -868,9 +868,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6a05620bd67..64c633e0398 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -495,10 +495,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 96c04e9add7..f83934cf6d7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1791,8 +1791,8 @@ heapam_index_build_range_scan(Relation heapRelation,
  */
 static int
 heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
-										   Tuplesortstate  *aux,
-										   Tuplestorestate *store)
+									  Tuplesortstate  *aux,
+									  Tuplestorestate *store)
 {
 	int				num = 0;
 	/* state variables for the merge */
@@ -2050,7 +2050,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot resert at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,9 +2062,35 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
-	 * Now take the snapshot that will be used by to filter candidate
-	 * tuples.
+	 * sanity checks
+	 */
+	Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+															  auxState->tuplesort,
+															  tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
 	 *
 	 * Beware!  There might still be snapshots in use that treat some transaction
 	 * as in-progress that our temporary snapshot treats as committed.
@@ -2079,33 +2106,10 @@ heapam_index_validate_scan(Relation heapRelation,
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
 	limitXmin = snapshot->xmin;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
-	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
-
-	/*
-	 * sanity checks
-	 */
-	Assert(OidIsValid(indexRelation->rd_rel->relam));
-
-	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
-														 auxState->tuplesort,
-														 tuples_for_check);
-
-	/* It is our responsibility to sloe tuple sort as fast as we can */
-	tuplesort_end(state->tuplesort);
-	tuplesort_end(auxState->tuplesort);
-
-	state->tuplesort = auxState->tuplesort = NULL;
-
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2142,6 +2146,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2196,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+
+		if (page_read_counter % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8b236c8ccd6..62e975016ad 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 894aefa19e1..6a6b1f8797b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -190,14 +190,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -811,7 +813,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -925,6 +926,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -965,6 +970,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 3a89d18505c..1943dd46243 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3477,8 +3477,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3491,7 +3492,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3574,6 +3575,7 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3609,6 +3611,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3638,9 +3643,6 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	}
 	tuplesort_performsort(state.tuplesort);
 	tuplesort_performsort(auxState.tuplesort);
-
-	PopActiveSnapshot();
-	InvalidateCatalogSnapshot();
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index cd0d63ded82..e10f6098f58 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4354,7 +4354,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 0cab8653f1b..3d8db998c0b 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [application/octet-stream] v13-0010-Add-proper-handling-of-auxiliary-indexes-during-.patch (28.7K, 12-v13-0010-Add-proper-handling-of-auxiliary-indexes-during-.patch)
  download | inline diff:
From 83db25df46f9e763487d7cc167ceb065b7f293dc Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v13 10/11] Add proper handling of auxiliary indexes during
 DROP/REINDEX operations

During concurrent index operations, an auxiliary index may be created to help
with the process. In case of error during the building process (for example in case of index constraint violation) such indexes became junk-indexes without any function. This patch improves the handling of such auxiliary indexes:

* Add auxiliaryForIndexId parameter to index_create() to track dependencies
* Automatically drop auxiliary indexes when the main index is dropped
* Delete junk auxiliary indexes properly during REINDEX operations
* Add regression tests to verify new behaviour
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 54566223cb0..fb7cd15f5fe 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -661,10 +661,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 64c633e0398..c6db5d57167 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 096b68c7f39..1c2cfc94b54 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1943dd46243..b7d42c6965f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -687,6 +687,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -733,6 +735,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -775,6 +778,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1176,6 +1181,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1458,6 +1472,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1608,6 +1623,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3824,6 +3840,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3880,6 +3897,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4168,7 +4198,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4257,13 +4288,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4289,18 +4337,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b98851a9e35..ab6dbd32d9f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1224,7 +1224,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3593,6 +3593,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 	} ReindexIndexInfo;
@@ -3941,6 +3942,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3948,6 +3950,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4010,12 +4013,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4025,6 +4033,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4045,10 +4054,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4205,7 +4222,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4224,6 +4242,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4406,6 +4427,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4451,6 +4474,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 4181c110eb7..e9b6ded6a55 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1492,6 +1492,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1552,9 +1554,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1606,6 +1619,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1634,12 +1675,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 01f85e57ea2..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 34331e4d48b..d858545dba3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3096,20 +3096,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index b410fa5c541..95e6f72fd4c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1273,11 +1273,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v13-0011-Updates-index-insert-and-value-computation-logic.patch (2.2K, 13-v13-0011-Updates-index-insert-and-value-computation-logic.patch)
  download | inline diff:
From 24ccb2f86a38ddcda6f1e2e5b961e468d78100a0 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v13 11/11] Updates index insert and value computation logic to
 optimize auxiliary index handling.

* Skip index value computation for auxiliary indices since they are not needed
* Set indexUnchanged=false for auxiliary indices to avoid unnecessary checks
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b7d42c6965f..26ef4dfea27 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2929,6 +2929,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ae11c1dd463..d070f80795d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -434,11 +434,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [image/png] io2.png (63.4K, 14-io2.png)
  download | view image

  [image/png] local.png (64.4K, 15-local.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-02-04 01:38  Michail Nikolaev <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2025-02-04 01:38 UTC (permalink / raw)
  To: [email protected]; +Cc: Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, Alvaro!

I want to bring your attention to that patch because I think (and hope :P)
you might be interested in it since it all began with your work in 2021 [0].
That feature (ability to create\reindex indexes concurrently without
impacting vacuum horizon) made my life better :) Unfortunately, due to [1]
only for short period.
As you said in [3] :
> Deciding to revert makes me sad, because this feature is extremely
valuable for users
so, I highly agree with you here.

It is started with some ideas about the smaller patch scope but ended as:
• [CREATE|RE]INDEX CONCURRENTLY affects vacuum just for a few transactions
(snapshots are reset regularly)
• CI/RC is achieved in (almost) single heap scan (yes, about 3x-2x faster
in many cases)
• all core MVCC-related code is unaffected, everything is protected using
regular snapshots (as I remember Anders was against any changes into that
part)
• feature was actively tested for correctness - I have found five other
issues trying to find bugs in the patch (including [4] - bug is amcheck
itself, which I was using for testing indexes for correctness under the
stress, it was a tough story).
• benchmark shows great results (see attachments) and [2] and [5] and [6]
for more results, details and explanations

In a few words, it works like this:
• before building the index, an auxiliary index of the new empty STIR
(short-term index replacement) access method is created (for the same
columns, predicates, etc.). STIR in unlogged and only stores TIDs of new
coming tuples (datums are not even prepared for it during insert if
possible)
• during the first scan of heap, snapshot used for scan is being reset
every few pages, allowing xmin to propagate (in case of unique index we
also need some additional logic to handle correctness)
• instead of the second heap scan – we just check tids of target and
auxiliary indexes - and insert everything present in STIR but absent in the
target index (also with resetting snapshots every few pages during that)
• auxiliary STIR index then dropped (it also dropped in other cases to
avoid burden for DB administrators)

I have split the patch into 12 commits, some parts may be committed
separately. Some explanation about separation of patches may be found at
[7]. I have tried to structure them as much as possible (each improves some
small part of the whole set). Commit messages explain changes (I hope).

I may provide any additional details you may need – feel free to ask. Also,
I have some infrastructure for benchmarks and validation tests, so, you if
you want to check/test  – I am happy to help.

I know it may feel like a naïve “miracle” patch from a dummy (2x index
building speedup without affecting horizon, aha) – but give it a chance.

Also, the last version of the patch in attach.

Best regards,
Mikhail.

[0]:
https://www.postgresql.org/message-id/[email protected]
[1]:
https://www.postgresql.org/message-id/17485-396609c6925b982d%40postgresql.org
[2]:
https://discord.com/channels/1258108670710124574/1259884843165155471/1334565506149253150
[3]:
https://www.postgresql.org/message-id/flat/202205251643.2py5jjpaw7wy%40alvherre.pgsql#589508d30b480b...
[4]:
https://www.postgresql.org/message-id/flat/CAH2-WzmcFDK2OzziTgdHxPTmaRQmSFLoDjS-C06uWGTsXibx9g%40mai...
[5]:
https://www.postgresql.org/message-id/flat/CANtu0ojHAputNCH73TEYN_RUtjLGYsEyW1aSXmsXyvwf%3D3U4qQ%40m...
[6]:
https://www.postgresql.org/message-id/flat/CANtu0oi7d0_8oHpDPi_vFsuD0h71LNL4U2XXg0kq7iY_Ys3%2BSA%40m...
[7]:
https://www.postgresql.org/message-id/flat/CANtu0og-4pvn4%2BTCWH6U9ghyd7x7NBAZSgi4ZWyBZdBWH6OpWA%40m...


Attachments:

  [application/octet-stream] v14-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch (39.0K, 3-v14-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch)
  download | inline diff:
From 31f62e7cd67cb02449ed02e5cf7d5d489ad7f20f Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v14 05/12] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 263 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3b3cbe571ac..bc3d3738ede 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 53363ee695a..f8976de6784 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 810f80fc8e6..8b236c8ccd6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 00e17a1f0f9..647f8e7b3af 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4684,7 +4682,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4802,17 +4800,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4838,6 +4843,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4857,7 +4864,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4868,7 +4875,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4877,6 +4885,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4885,7 +4895,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4902,6 +4913,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 707ff39ef40..d937ba65c9c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3293,9 +3293,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index c8e7880f954..5921dcf68a1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,8 +1670,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..0b25926bc56 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -30,6 +30,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -123,6 +124,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +351,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +394,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1524,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1534,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index b88bd443554..e756ad9b5b0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9a9b094f3f1..d69baaa364f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1799,9 +1799,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..76131b6f2e1 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v14-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (17.5K, 4-v14-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From e4e33536ec7137caedd31eea050589c8398cb800 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v14 01/12] This is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 6 files changed, 216 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d6e23caef17..0ff498c4e14 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 7c87f012c30..ae11c1dd463 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 7e71d422a62..3922ae39681 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1af8c9caf6c..8a1a085b106 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index b9759c31252..f91203dd353 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8f1508b1ee2..3d018c3a1e8 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -388,6 +389,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
-- 
2.43.0



  [application/octet-stream] v14-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch (43.6K, 5-v14-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch)
  download | inline diff:
From 07354569b88ef5bde90c7f56fdacdc6821891f69 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v14 03/12] Allow advancing xmin during non-unique,
 non-parallel concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  16 +++
 src/backend/access/gin/gininsert.c            |   3 +
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 407 insertions(+), 34 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 7f7b55d902a..a026fbc692a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 9a984547578..c21608a6fd8 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1224,6 +1224,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1243,6 +1244,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2366,6 +2368,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2394,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2446,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2527,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2545,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8e1788dbcf7..97ef10c0098 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -21,6 +21,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -375,6 +376,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	/*
 	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
 	 * prefers to receive tuples in TID order.
@@ -423,6 +425,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	return result;
 }
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f950b9925f5..901aa667aa0 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -191,6 +191,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 485525f4d64..86286dc89c3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -568,6 +569,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -609,7 +640,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1236,6 +1272,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e817f8f8f84..580ec7f9aa8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 07bae342e25..0d262a4188d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7aba852db90..b490da0eeee 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 221fbb4e286..8c6dfecf515 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3206,7 +3223,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3269,12 +3287,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0ff498c4e14..c8e7880f954 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e92e108b6b6..a26e0832e38 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6778,6 +6779,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6833,6 +6835,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6890,6 +6897,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7d06dad83fc..43bdf62b944 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 09b9b394e0e..ec8928ad90b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1775,6 +1797,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..2225cd0bf87 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 989b4db226b..fb131270668 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v14-0002-Add-stress-tests-for-concurrent-index-operations.patch (8.0K, 6-v14-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 6659bd291b5412de62ecdae76d8cac30f0f8487b Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v14 02/12] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 189 ++++++++++++++++++++++++++++++++
 2 files changed, 190 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..a9559dbe3af
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,189 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for  GIN/GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 4)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIN (ia);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIST (p);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING BRIN (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING HASH (updated_at);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v14-0004-Allow-snapshot-resets-during-parallel-concurrent.patch (34.1K, 7-v14-0004-Allow-snapshot-resets-during-parallel-concurrent.patch)
  download | inline diff:
From 4ed1282bea6a0515f9e91421da46d88688075305 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v14 04/12] Allow snapshot resets during parallel concurrent
 index builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 49 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 13 files changed, 196 insertions(+), 67 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c21608a6fd8..e580483a7cb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1244,7 +1243,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1259,6 +1257,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2359,7 +2358,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2390,25 +2388,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2448,8 +2446,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2474,7 +2470,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2520,7 +2517,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2536,6 +2532,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2544,7 +2547,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2567,9 +2571,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2769,14 +2770,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2798,6 +2799,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2938,6 +2940,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 580ec7f9aa8..3b3cbe571ac 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b490da0eeee..810f80fc8e6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e18a8f8250f..b5b7be60a5e 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 7817bedc2ef..e9c0a46fd78 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1495,6 +1533,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8c6dfecf515..707ff39ef40 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index fa2d522b25f..ef4d0ae2fab 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3d018c3a1e8..4cd536e988c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 8811618acb7..f5cae39c85f 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index dc6e0184284..8529b808aed 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index ec8928ad90b..9a9b094f3f1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1180,7 +1180,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1798,9 +1799,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v14-0009-Concurrently-built-index-validation-uses-fresh-s.patch (14.1K, 8-v14-0009-Concurrently-built-index-validation-uses-fresh-s.patch)
  download | inline diff:
From 260ccd0b439c99141c9366acbff9d7fc72882404 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 17:21:29 +0100
Subject: [PATCH v14 09/12] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/heapam_handler.c | 77 +++++++++++++++---------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 14 +++--
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 +++++
 8 files changed, 97 insertions(+), 47 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e33345f6a34..54566223cb0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -868,9 +868,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6a05620bd67..64c633e0398 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -495,10 +495,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index efa42064c7a..fb513774c0d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1791,8 +1791,8 @@ heapam_index_build_range_scan(Relation heapRelation,
  */
 static int
 heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
-										   Tuplesortstate  *aux,
-										   Tuplestorestate *store)
+									  Tuplesortstate  *aux,
+									  Tuplestorestate *store)
 {
 	int				num = 0;
 	/* state variables for the merge */
@@ -2048,7 +2048,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot resert at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2059,9 +2060,35 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
-	 * Now take the snapshot that will be used by to filter candidate
-	 * tuples.
+	 * sanity checks
+	 */
+	Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+															  auxState->tuplesort,
+															  tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
 	 *
 	 * Beware!  There might still be snapshots in use that treat some transaction
 	 * as in-progress that our temporary snapshot treats as committed.
@@ -2077,33 +2104,10 @@ heapam_index_validate_scan(Relation heapRelation,
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
 	limitXmin = snapshot->xmin;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
-	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
-
-	/*
-	 * sanity checks
-	 */
-	Assert(OidIsValid(indexRelation->rd_rel->relam));
-
-	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
-														 auxState->tuplesort,
-														 tuples_for_check);
-
-	/* It is our responsibility to sloe tuple sort as fast as we can */
-	tuplesort_end(state->tuplesort);
-	tuplesort_end(auxState->tuplesort);
-
-	state->tuplesort = auxState->tuplesort = NULL;
-
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2140,6 +2144,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2194,6 +2199,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+
+		if (page_read_counter % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8b236c8ccd6..62e975016ad 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 894aefa19e1..6a6b1f8797b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -190,14 +190,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -811,7 +813,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -925,6 +926,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -965,6 +970,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 3a89d18505c..1943dd46243 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3477,8 +3477,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3491,7 +3492,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3574,6 +3575,7 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3609,6 +3611,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3638,9 +3643,6 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	}
 	tuplesort_performsort(state.tuplesort);
 	tuplesort_performsort(auxState.tuplesort);
-
-	PopActiveSnapshot();
-	InvalidateCatalogSnapshot();
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index cd0d63ded82..e10f6098f58 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4354,7 +4354,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 0cab8653f1b..3d8db998c0b 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [application/octet-stream] v14-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch (37.0K, 9-v14-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch)
  download | inline diff:
From 58c6c83c35bb44161c4500995ad413fce0938b3c Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v14 06/12] Add STIR (Short-Term Index Replacement) access
 method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 09fab08b8e1..aaf55d689d2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..b844bcb21d7
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d937ba65c9c..2dbf8f82141 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3403,6 +3403,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2a7769b1fd1..f27d9041e2c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0d92e694d6a..a39d36c3539 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 6b66bc18286..694a2518ba5 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1be8739573f..44f8a0d5606 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 43445cdcc6c..26ddd5ec577 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b37e8a6f882..5ea2b12bf0a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b3f7aa299f5..7bfe0acb91c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v14-0007-tuplestore-add-support-for-storing-Datum-values.patch (17.3K, 10-v14-0007-tuplestore-add-support-for-storing-Datum-values.patch)
  download | inline diff:
From 812b58f910119ccec6c0023fa0f7a89d49c07867 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v14 07/12] tuplestore: add support for storing Datum values

Add ability to store and retrieve individual Datum values in tuplestore, optimizing storage based on type:

- Fixed-length: stores raw bytes without length prefix
- Variable-length: includes length prefix/suffix
- By-value types handled inline

This extends tuplestore beyond just handling tuples, planned to be used in next patch.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index aacec8b7993..4ed13da6046 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * 1024L;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index ed7c454f44e..1f431863387 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v14-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch (109.9K, 11-v14-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch)
  download | inline diff:
From 854d020c0ed3db18c29f7f3a6e7a0848b90b3842 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v14 08/12] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 doc/src/sgml/monitoring.sgml                  |  26 +-
 doc/src/sgml/ref/create_index.sgml            |  33 +-
 doc/src/sgml/ref/reindex.sgml                 |  43 +-
 src/backend/access/heap/README.HOT            |  15 +-
 src/backend/access/heap/heapam_handler.c      | 591 ++++++++++++------
 src/backend/catalog/index.c                   | 312 +++++++--
 src/backend/catalog/system_views.sql          |  17 +-
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 376 ++++++++---
 src/backend/nodes/makefuncs.c                 |   4 +-
 src/include/access/tableam.h                  |  31 +-
 src/include/catalog/index.h                   |  12 +-
 src/include/commands/progress.h               |  13 +-
 src/include/nodes/execnodes.h                 |   4 +-
 src/include/nodes/makefuncs.h                 |   3 +-
 .../expected/cic_reset_snapshots.out          |  28 +
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |  42 ++
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/expected/rules.out           |  17 +-
 src/test/regress/sql/create_index.sql         |  21 +
 21 files changed, 1193 insertions(+), 402 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d0d176cc54f..cf7a3bf5271 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6202,6 +6202,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6242,13 +6254,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6265,8 +6276,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 208389e8006..e33345f6a34 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -614,25 +614,24 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
-    significantly longer to complete.  However, since it allows normal
+    <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
+    This method requires more total work than a standard index build and takes
+    longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
     and I/O load imposed by the index creation might slow other operations.
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
+    In a concurrent index build, the main and auxiliary indexes is actually entered as an
     <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -645,10 +644,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -658,11 +658,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..6a05620bd67 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,11 +368,10 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
-    rebuild and takes significantly longer to complete as it needs to wait
+    rebuild and takes longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
     it allows normal operations to continue while the index is being rebuilt, this
     method is useful for rebuilding indexes in a production environment. Of
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bc3d3738ede..efa42064c7a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,450 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Now take the snapshot that will be used by to filter candidate
+	 * tuples.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
+
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2dbf8f82141..3a89d18505c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -743,7 +748,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -754,11 +760,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +797,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1407,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1462,7 +1473,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1472,6 +1484,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2467,7 +2628,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2527,7 +2689,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3276,12 +3439,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3291,18 +3463,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (ut these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3310,12 +3485,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3333,22 +3510,27 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3381,12 +3563,16 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
 
 	/* mark build is concurrent just for consistency */
@@ -3405,15 +3591,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3436,27 +3637,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3465,8 +3672,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3525,6 +3736,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3796,6 +4012,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4038,6 +4261,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4063,6 +4287,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7a595c84db9..0e4d977db87 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1265,16 +1265,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5921dcf68a1..cd0d63ded82 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -554,6 +557,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -563,6 +567,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -584,10 +589,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -834,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -929,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1227,7 +1242,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1569,6 +1585,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1597,11 +1623,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1611,7 +1637,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1650,7 +1676,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1662,15 +1688,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1698,43 +1748,31 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1757,12 +1795,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1787,6 +1825,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3542,6 +3627,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3647,8 +3733,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3700,8 +3793,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3762,6 +3862,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3865,15 +3972,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3924,6 +4034,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3937,12 +4052,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3951,6 +4071,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3969,10 +4090,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4053,13 +4178,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4102,24 +4269,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4134,13 +4329,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4152,16 +4340,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4181,7 +4361,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4271,14 +4451,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4303,6 +4483,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4316,11 +4518,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4340,6 +4542,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 694a2518ba5..4af3d3f7455 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -784,7 +784,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -800,6 +800,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -825,7 +826,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index d69baaa364f..e2c0fc8fd66 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,11 +714,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1862,22 +1862,25 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both state and auxstate.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..01f85e57ea2 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 18e3179ef63..4c3ea686494 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -92,14 +92,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7bfe0acb91c..8ab74e2b1d9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -177,8 +177,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 8011c141bf8..34331e4d48b 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3028,6 +3029,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3040,8 +3042,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3069,6 +3073,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3014d047fef..e0a46c0a42a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2013,14 +2013,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 068c66b95a5..b410fa5c541 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1244,10 +1245,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1259,6 +1262,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v14-0010-Remove-PROC_IN_SAFE_IC-optimization.patch (20.6K, 12-v14-0010-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 89161b74ea0470f8f030c0e09d15cc4cfc5adeb0 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v14 10/12] Remove PROC_IN_SAFE_IC optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 8 files changed, 11 insertions(+), 233 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e580483a7cb..b4b36bda018 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2885,11 +2885,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 62e975016ad..1eb4299826e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e10f6098f58..b98851a9e35 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -116,7 +116,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -419,10 +418,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -443,8 +439,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -464,8 +459,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -579,7 +573,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1157,10 +1150,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1647,10 +1636,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1705,9 +1690,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1737,10 +1719,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1766,9 +1744,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1785,9 +1761,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1828,10 +1801,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1852,10 +1821,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3630,7 +3595,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -4002,17 +3966,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4072,7 +4025,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4165,11 +4117,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4200,10 +4147,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4212,11 +4155,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4241,10 +4179,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4264,11 +4198,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4289,10 +4218,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4325,10 +4250,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4356,9 +4277,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4380,13 +4298,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4442,12 +4353,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4509,12 +4414,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4774,36 +4673,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 20777f7d5ae..4bd24bc02d4 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 2225cd0bf87..b257a0344a8 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc cic_reset_snapshots
+REGRESS = injection_points cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index fb131270668..051b3e789c1 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -34,7 +34,6 @@ tests += {
   'regress': {
     'sql': [
       'injection_points',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v14-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch (28.7K, 13-v14-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch)
  download | inline diff:
From ba3602eca96f62328b966bcecb4d8afc57edca56 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v14 11/12] Add proper handling of auxiliary indexes during
 DROP/REINDEX operations

During concurrent index operations, an auxiliary index may be created to help
with the process. In case of error during the building process (for example in case of index constraint violation) such indexes became junk-indexes without any function. This patch improves the handling of such auxiliary indexes:

* Add auxiliaryForIndexId parameter to index_create() to track dependencies
* Automatically drop auxiliary indexes when the main index is dropped
* Delete junk auxiliary indexes properly during REINDEX operations
* Add regression tests to verify new behaviour
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 54566223cb0..fb7cd15f5fe 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -661,10 +661,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 64c633e0398..c6db5d57167 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 096b68c7f39..1c2cfc94b54 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1943dd46243..b7d42c6965f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -687,6 +687,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -733,6 +735,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -775,6 +778,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1176,6 +1181,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1458,6 +1472,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1608,6 +1623,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3824,6 +3840,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3880,6 +3897,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4168,7 +4198,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4257,13 +4288,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4289,18 +4337,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b98851a9e35..ab6dbd32d9f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1224,7 +1224,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3593,6 +3593,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 	} ReindexIndexInfo;
@@ -3941,6 +3942,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3948,6 +3950,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4010,12 +4013,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4025,6 +4033,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4045,10 +4054,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4205,7 +4222,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4224,6 +4242,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4406,6 +4427,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4451,6 +4474,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 4181c110eb7..e9b6ded6a55 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1492,6 +1492,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1552,9 +1554,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1606,6 +1619,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1634,12 +1675,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 01f85e57ea2..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 34331e4d48b..d858545dba3 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3096,20 +3096,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index b410fa5c541..95e6f72fd4c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1273,11 +1273,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v14-0012-Updates-index-insert-and-value-computation-logic.patch (2.2K, 14-v14-0012-Updates-index-insert-and-value-computation-logic.patch)
  download | inline diff:
From bea49e40e16a156e7c942f1fdc9afb0237ec2d1f Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v14 12/12] Updates index insert and value computation logic to
 optimize auxiliary index handling.

* Skip index value computation for auxiliary indices since they are not needed
* Set indexUnchanged=false for auxiliary indices to avoid unnecessary checks
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b7d42c6965f..26ef4dfea27 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2929,6 +2929,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ae11c1dd463..d070f80795d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -434,11 +434,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [image/png] bench.png (44.9K, 15-bench.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-02-20 14:56  Mihail Nikalayeu <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Mihail Nikalayeu @ 2025-02-20 14:56 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: [email protected]; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone.

Just rebased.

Also, this is Discord thread:
https://discordapp.com/channels/1258108670710124574/1259884843165155471/1334565506149253150


Attachments:

  [text/x-patch] v15-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch (39.0K, 3-v15-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch)
  download | inline diff:
From 71beb7388cd35015d7d1b2f7cfc550f075afb8d3 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v15 05/12] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 263 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2a617a05f8c..76837203601 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index cbe73675f86..5db6d237c2c 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 810f80fc8e6..8b236c8ccd6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 693e43c674b..f9695fba8b5 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -51,8 +51,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2828,7 +2826,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -2946,17 +2944,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -2982,6 +2987,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -3001,7 +3008,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -3012,7 +3019,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -3021,6 +3029,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -3029,7 +3039,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -3046,6 +3057,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index dacff9605ad..53be1269aff 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3296,9 +3296,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 9eddeb93338..5ebc50831be 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1700,8 +1700,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..0b25926bc56 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -30,6 +30,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -123,6 +124,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +351,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +394,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1524,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1534,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 000c7289b80..ac7abbf8fc5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1314,8 +1314,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 313394d92c6..b1920999f12 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1800,9 +1800,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..76131b6f2e1 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [text/x-patch] v15-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (17.5K, 4-v15-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From 561e06110de17c3a3d95ea137e57b10c29657f30 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v15 01/12] This is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 6 files changed, 216 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f8d3ea820e1..47c509ceb3e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1796,6 +1796,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4201,7 +4202,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4280,6 +4281,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 742f3f8c08d..f2a74b76465 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -943,6 +944,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 432eeaf9034..44c0e8ed285 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -487,6 +487,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -697,6 +739,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -707,23 +751,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index b0fe50075ad..d5ad73f6f69 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1158,6 +1159,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 71abb01f655..af7586a428f 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8f1508b1ee2..3d018c3a1e8 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -388,6 +389,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
-- 
2.43.0



  [text/x-patch] v15-0002-Add-stress-tests-for-concurrent-index-operations.patch (8.0K, 5-v15-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From afd797e7a0731ce1b35511d2e8724b20a72e413e Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v15 02/12] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 189 ++++++++++++++++++++++++++++++++
 2 files changed, 190 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..a9559dbe3af
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,189 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for  GIN/GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 4)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIN (ia);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIST (p);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING BRIN (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING HASH (updated_at);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [text/x-patch] v15-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch (43.7K, 6-v15-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch)
  download | inline diff:
From 203606ebbeb826420cba6494a0a72a6bc9b8d69e Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v15 03/12] Allow advancing xmin during non-unique,
 non-parallel concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  16 +++
 src/backend/access/gin/gininsert.c            |   3 +
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 407 insertions(+), 34 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index aac8c74f546..63a08fbe615 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 60320440fc5..f1dba9e8185 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1228,6 +1228,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1247,6 +1248,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2370,6 +2372,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2395,9 +2398,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2440,6 +2450,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2519,6 +2531,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2535,6 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d1b5e8f0dd1..a5184e7d89d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -21,6 +21,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -375,6 +376,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	/*
 	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
 	 * prefers to receive tuples in TID order.
@@ -423,6 +425,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	return result;
 }
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 02ec1126a4c..a17070d560f 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -194,6 +194,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fa7935a0ed3..def4fe20d1e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -570,6 +571,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -611,7 +642,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1256,6 +1292,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c0bec014154..6d4de77037c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 07bae342e25..0d262a4188d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7aba852db90..b490da0eeee 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cdabf780244..210fc88433f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3209,7 +3226,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3272,12 +3290,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 47c509ceb3e..9eddeb93338 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1700,23 +1700,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4079,9 +4073,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4096,7 +4087,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7b1a8a0a9f1..7d23540bf5c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6781,6 +6782,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6836,6 +6838,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6893,6 +6900,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..f5bb04d5bd1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 131c050c15f..5393b30c57e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -936,7 +948,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -944,6 +957,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1776,6 +1798,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..19d26408c2a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc
+REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..8476bfe72a7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -37,6 +37,7 @@ tests += {
       'injection_points',
       'hashagg',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [text/x-patch] v15-0004-Allow-snapshot-resets-during-parallel-concurrent.patch (34.1K, 7-v15-0004-Allow-snapshot-resets-during-parallel-concurrent.patch)
  download | inline diff:
From 225b911a2d733030a42e68152ad47f86db9a715b Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v15 04/12] Allow snapshot resets during parallel concurrent
 index builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 49 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 13 files changed, 196 insertions(+), 67 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index f1dba9e8185..d8317787251 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1248,7 +1247,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1263,6 +1261,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2363,7 +2362,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2394,25 +2392,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2452,8 +2450,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2478,7 +2474,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2524,7 +2521,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2540,6 +2536,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2548,7 +2551,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2571,9 +2575,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2773,14 +2774,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2802,6 +2803,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2942,6 +2944,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6d4de77037c..2a617a05f8c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b490da0eeee..810f80fc8e6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..277c79dd554 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 4ab5df92133..ec3c80fef27 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1495,6 +1533,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 210fc88433f..dacff9605ad 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 6f9e991eeae..bc639964ada 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3d018c3a1e8..4cd536e988c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 8811618acb7..f5cae39c85f 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index dc6e0184284..8529b808aed 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5393b30c57e..313394d92c6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1181,7 +1181,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1799,9 +1800,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [text/x-patch] v15-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch (37.0K, 8-v15-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch)
  download | inline diff:
From 743b00180b5b44f57793a4541f8df6481054b433 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v15 06/12] Add STIR (Short-Term Index Replacement) access
 method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1af18a78a2b..63d6d1738bb 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3064,6 +3064,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3115,6 +3116,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..01f3b660f4b
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 53be1269aff..3e2752c0285 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3406,6 +3406,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index cd75954951b..ab7c678bf9a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..286fcccec3d 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 007612563ca..a50afeae674 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -828,6 +828,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1be8739573f..44f8a0d5606 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 43445cdcc6c..26ddd5ec577 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9e803d610d7..3f17fba1b04 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a323fa98bbb..8c0ad96e02c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -182,12 +182,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,6 +217,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index f9db4032e1f..4e3ddd6810e 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5130,7 +5130,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5144,7 +5145,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5169,9 +5171,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5180,12 +5182,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5194,7 +5197,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [text/x-patch] v15-0007-tuplestore-add-support-for-storing-Datum-values.patch (17.3K, 9-v15-0007-tuplestore-add-support-for-storing-Datum-values.patch)
  download | inline diff:
From aa02347cdde0bd767ad858998992ece1ff57bba9 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v15 07/12] tuplestore: add support for storing Datum values

Add ability to store and retrieve individual Datum values in tuplestore, optimizing storage based on type:

- Fixed-length: stores raw bytes without length prefix
- Variable-length: includes length prefix/suffix
- By-value types handled inline

This extends tuplestore beyond just handling tuples, planned to be used in next patch.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index d61b601053c..03434f3ea49 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index ed7c454f44e..1f431863387 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [text/x-patch] v15-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch (109.9K, 10-v15-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch)
  download | inline diff:
From b9edf5be44419f0a4caa755b11e63f2e74ebf7d3 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v15 08/12] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 doc/src/sgml/monitoring.sgml                  |  26 +-
 doc/src/sgml/ref/create_index.sgml            |  33 +-
 doc/src/sgml/ref/reindex.sgml                 |  43 +-
 src/backend/access/heap/README.HOT            |  15 +-
 src/backend/access/heap/heapam_handler.c      | 591 ++++++++++++------
 src/backend/catalog/index.c                   | 312 +++++++--
 src/backend/catalog/system_views.sql          |  17 +-
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 376 ++++++++---
 src/backend/nodes/makefuncs.c                 |   4 +-
 src/include/access/tableam.h                  |  31 +-
 src/include/catalog/index.h                   |  12 +-
 src/include/commands/progress.h               |  13 +-
 src/include/nodes/execnodes.h                 |   4 +-
 src/include/nodes/makefuncs.h                 |   3 +-
 .../expected/cic_reset_snapshots.out          |  28 +
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |  42 ++
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/expected/rules.out           |  17 +-
 src/test/regress/sql/create_index.sql         |  21 +
 21 files changed, 1193 insertions(+), 402 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 71c4f96d054..aa16e21e87a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6288,6 +6288,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6328,13 +6340,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6351,8 +6362,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 208389e8006..e33345f6a34 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -614,25 +614,24 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
-    significantly longer to complete.  However, since it allows normal
+    <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
+    This method requires more total work than a standard index build and takes
+    longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
     and I/O load imposed by the index creation might slow other operations.
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
+    In a concurrent index build, the main and auxiliary indexes is actually entered as an
     <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -645,10 +644,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -658,11 +658,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..6a05620bd67 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,11 +368,10 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
-    rebuild and takes significantly longer to complete as it needs to wait
+    rebuild and takes longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
     it allows normal operations to continue while the index is being rebuilt, this
     method is useful for rebuilding indexes in a production environment. Of
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 76837203601..7ebab0922a9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,450 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Now take the snapshot that will be used by to filter candidate
+	 * tuples.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
+
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 3e2752c0285..deb48e97dd4 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -743,7 +748,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -754,11 +760,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +797,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1407,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1462,7 +1473,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1472,6 +1484,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2468,7 +2629,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2528,7 +2690,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3279,12 +3442,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3294,18 +3466,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (ut these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3313,12 +3488,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3336,22 +3513,27 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3384,12 +3566,16 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
 
 	/* mark build is concurrent just for consistency */
@@ -3408,15 +3594,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3439,27 +3640,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3468,8 +3675,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3528,6 +3739,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3799,6 +4015,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4041,6 +4264,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4066,6 +4290,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eff0990957e..0fedf74a12d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1278,16 +1278,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5ebc50831be..63ed47cfb25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,10 +588,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -833,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1257,7 +1272,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1599,6 +1615,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1627,11 +1653,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1641,7 +1667,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1680,7 +1706,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1692,15 +1718,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1728,43 +1778,31 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1787,12 +1825,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1817,6 +1855,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3537,6 +3622,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3642,8 +3728,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3695,8 +3788,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3757,6 +3857,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3860,15 +3967,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3919,6 +4029,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3932,12 +4047,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3946,6 +4066,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3964,10 +4085,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4048,13 +4173,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4097,24 +4264,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4129,13 +4324,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4147,16 +4335,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4176,7 +4356,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4266,14 +4446,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4298,6 +4478,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4311,11 +4513,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4335,6 +4537,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index a50afeae674..9990733ea8e 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -787,7 +787,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -803,6 +803,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -828,7 +829,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b1920999f12..1b2ef8f8002 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -715,11 +715,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1863,22 +1863,25 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both state and auxstate.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..01f85e57ea2 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8c0ad96e02c..4826c1a5538 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -187,8 +187,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index bd5f002cf20..34362e3d875 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3049,6 +3050,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3061,8 +3063,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3090,6 +3094,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5baba8d39ff..436c736e64c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,14 +2020,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index be570da08a0..fcff5d19998 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1250,10 +1251,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1265,6 +1268,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [text/x-patch] v15-0009-Concurrently-built-index-validation-uses-fresh-s.patch (14.1K, 11-v15-0009-Concurrently-built-index-validation-uses-fresh-s.patch)
  download | inline diff:
From 9cbf0b69a7b97e222335f6d2265aa13adf9cab29 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 17:21:29 +0100
Subject: [PATCH v15 09/12] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/heapam_handler.c | 77 +++++++++++++++---------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 14 +++--
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 +++++
 8 files changed, 97 insertions(+), 47 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e33345f6a34..54566223cb0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -868,9 +868,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6a05620bd67..64c633e0398 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -495,10 +495,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7ebab0922a9..59ffc9cf4f7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1791,8 +1791,8 @@ heapam_index_build_range_scan(Relation heapRelation,
  */
 static int
 heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
-										   Tuplesortstate  *aux,
-										   Tuplestorestate *store)
+									  Tuplesortstate  *aux,
+									  Tuplestorestate *store)
 {
 	int				num = 0;
 	/* state variables for the merge */
@@ -2048,7 +2048,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot resert at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2059,9 +2060,35 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
-	 * Now take the snapshot that will be used by to filter candidate
-	 * tuples.
+	 * sanity checks
+	 */
+	Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+															  auxState->tuplesort,
+															  tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
 	 *
 	 * Beware!  There might still be snapshots in use that treat some transaction
 	 * as in-progress that our temporary snapshot treats as committed.
@@ -2077,33 +2104,10 @@ heapam_index_validate_scan(Relation heapRelation,
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
 	limitXmin = snapshot->xmin;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
-	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
-
-	/*
-	 * sanity checks
-	 */
-	Assert(OidIsValid(indexRelation->rd_rel->relam));
-
-	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
-														 auxState->tuplesort,
-														 tuples_for_check);
-
-	/* It is our responsibility to sloe tuple sort as fast as we can */
-	tuplesort_end(state->tuplesort);
-	tuplesort_end(auxState->tuplesort);
-
-	state->tuplesort = auxState->tuplesort = NULL;
-
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2140,6 +2144,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2194,6 +2199,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+
+		if (page_read_counter % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8b236c8ccd6..62e975016ad 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index eeddacd0d52..4130e49dd98 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -190,14 +190,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -811,7 +813,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -925,6 +926,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -965,6 +970,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index deb48e97dd4..cca6165339b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3480,8 +3480,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3494,7 +3495,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3577,6 +3578,7 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3612,6 +3614,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3641,9 +3646,6 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	}
 	tuplesort_performsort(state.tuplesort);
 	tuplesort_performsort(auxState.tuplesort);
-
-	PopActiveSnapshot();
-	InvalidateCatalogSnapshot();
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 63ed47cfb25..7805e43178f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4349,7 +4349,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [text/x-patch] v15-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch (28.7K, 12-v15-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch)
  download | inline diff:
From 9cb4a0884b596723007075cf6f8fd986c3fbe614 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v15 11/12] Add proper handling of auxiliary indexes during
 DROP/REINDEX operations

During concurrent index operations, an auxiliary index may be created to help
with the process. In case of error during the building process (for example in case of index constraint violation) such indexes became junk-indexes without any function. This patch improves the handling of such auxiliary indexes:

* Add auxiliaryForIndexId parameter to index_create() to track dependencies
* Automatically drop auxiliary indexes when the main index is dropped
* Delete junk auxiliary indexes properly during REINDEX operations
* Add regression tests to verify new behaviour
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 54566223cb0..fb7cd15f5fe 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -661,10 +661,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 64c633e0398..c6db5d57167 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 18316a3968b..ab4c3e2fb4a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cca6165339b..19201d26211 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -687,6 +687,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -733,6 +735,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -775,6 +778,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1176,6 +1181,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1458,6 +1472,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1608,6 +1623,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3827,6 +3843,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3883,6 +3900,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4171,7 +4201,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4260,13 +4291,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4292,18 +4340,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 1467cd89930..4466f7e6261 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1254,7 +1254,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3588,6 +3588,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 	} ReindexIndexInfo;
@@ -3936,6 +3937,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3943,6 +3945,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4005,12 +4008,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4020,6 +4028,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4040,10 +4049,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4200,7 +4217,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4219,6 +4237,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4401,6 +4422,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4446,6 +4469,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 9d8754be7e5..0a1397c7005 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1502,6 +1502,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1562,9 +1564,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1616,6 +1629,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1644,12 +1685,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 01f85e57ea2..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 34362e3d875..8aa6815b37c 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3117,20 +3117,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index fcff5d19998..5e5cf23d97d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1279,11 +1279,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [text/x-patch] v15-0012-Updates-index-insert-and-value-computation-logic.patch (2.2K, 13-v15-0012-Updates-index-insert-and-value-computation-logic.patch)
  download | inline diff:
From 22fcc557e73320e59243d9ae7dec863c6e283ece Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v15 12/12] Updates index insert and value computation logic to
 optimize auxiliary index handling.

* Skip index value computation for auxiliary indices since they are not needed
* Set indexUnchanged=false for auxiliary indices to avoid unnecessary checks
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 19201d26211..ff8a8a2731e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2932,6 +2932,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f2a74b76465..eef1b35e68c 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -441,11 +441,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [text/x-patch] v15-0010-Remove-PROC_IN_SAFE_IC-optimization.patch (20.7K, 14-v15-0010-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From c92be47661e551f69330bfbc85407d0f49897ed0 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v15 10/12] Remove PROC_IN_SAFE_IC optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 8 files changed, 11 insertions(+), 233 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d8317787251..0f839395c78 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2889,11 +2889,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 62e975016ad..1eb4299826e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7805e43178f..1467cd89930 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1187,10 +1180,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1677,10 +1666,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1735,9 +1720,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1767,10 +1749,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1796,9 +1774,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1815,9 +1791,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1858,10 +1831,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1882,10 +1851,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3625,7 +3590,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3997,17 +3961,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4067,7 +4020,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4160,11 +4112,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4195,10 +4142,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4207,11 +4150,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4236,10 +4174,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4259,11 +4193,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4284,10 +4213,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4320,10 +4245,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4351,9 +4272,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4375,13 +4293,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4437,12 +4348,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4504,12 +4409,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4769,36 +4668,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 20777f7d5ae..4bd24bc02d4 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 19d26408c2a..82acf3006bd 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
+REGRESS = injection_points hashagg cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8476bfe72a7..bddf22df3ac 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-03-07 22:58  Michail Nikolaev <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Michail Nikolaev @ 2025-03-07 22:58 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: [email protected]; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone!

Rebased + new parallel GIN builds supported.

Best regards,
MIkhail.

>


Attachments:

  [application/octet-stream] v16-0009-Concurrently-built-index-validation-uses-fresh-s.patch (14.1K, 3-v16-0009-Concurrently-built-index-validation-uses-fresh-s.patch)
  download | inline diff:
From e579bfb2df7bfa0c87d721e05c68c8013915d441 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 17:21:29 +0100
Subject: [PATCH v16 09/12] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/heapam_handler.c | 77 +++++++++++++++---------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 14 +++--
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 +++++
 8 files changed, 97 insertions(+), 47 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e33345f6a34..54566223cb0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -868,9 +868,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6a05620bd67..64c633e0398 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -495,10 +495,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2e3e8a678c9..a596fc9920a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1791,8 +1791,8 @@ heapam_index_build_range_scan(Relation heapRelation,
  */
 static int
 heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
-										   Tuplesortstate  *aux,
-										   Tuplestorestate *store)
+									  Tuplesortstate  *aux,
+									  Tuplestorestate *store)
 {
 	int				num = 0;
 	/* state variables for the merge */
@@ -2048,7 +2048,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot resert at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2059,9 +2060,35 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
-	 * Now take the snapshot that will be used by to filter candidate
-	 * tuples.
+	 * sanity checks
+	 */
+	Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+															  auxState->tuplesort,
+															  tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
 	 *
 	 * Beware!  There might still be snapshots in use that treat some transaction
 	 * as in-progress that our temporary snapshot treats as committed.
@@ -2077,33 +2104,10 @@ heapam_index_validate_scan(Relation heapRelation,
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
 	limitXmin = snapshot->xmin;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
-	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
-
-	/*
-	 * sanity checks
-	 */
-	Assert(OidIsValid(indexRelation->rd_rel->relam));
-
-	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
-														 auxState->tuplesort,
-														 tuples_for_check);
-
-	/* It is our responsibility to sloe tuple sort as fast as we can */
-	tuplesort_end(state->tuplesort);
-	tuplesort_end(auxState->tuplesort);
-
-	state->tuplesort = auxState->tuplesort = NULL;
-
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2140,6 +2144,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2194,6 +2199,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+
+		if (page_read_counter % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f7914ebb3d0..bb4e0fbb675 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index eeddacd0d52..4130e49dd98 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -190,14 +190,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -811,7 +813,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -925,6 +926,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -965,6 +970,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8df0b472e88..39d2f474865 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3485,8 +3485,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3499,7 +3500,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3582,6 +3583,7 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3617,6 +3619,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3646,9 +3651,6 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	}
 	tuplesort_performsort(state.tuplesort);
 	tuplesort_performsort(auxState.tuplesort);
-
-	PopActiveSnapshot();
-	InvalidateCatalogSnapshot();
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 51db7f23378..85f83a97a1f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4349,7 +4349,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [application/octet-stream] v16-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch (109.9K, 4-v16-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch)
  download | inline diff:
From 31b2324ce9aeff3db57e25b53f38d1b6207d2af5 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v16 08/12] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 doc/src/sgml/monitoring.sgml                  |  26 +-
 doc/src/sgml/ref/create_index.sgml            |  33 +-
 doc/src/sgml/ref/reindex.sgml                 |  43 +-
 src/backend/access/heap/README.HOT            |  15 +-
 src/backend/access/heap/heapam_handler.c      | 591 ++++++++++++------
 src/backend/catalog/index.c                   | 312 +++++++--
 src/backend/catalog/system_views.sql          |  17 +-
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 376 ++++++++---
 src/backend/nodes/makefuncs.c                 |   4 +-
 src/include/access/tableam.h                  |  31 +-
 src/include/catalog/index.h                   |  12 +-
 src/include/commands/progress.h               |  13 +-
 src/include/nodes/execnodes.h                 |   4 +-
 src/include/nodes/makefuncs.h                 |   3 +-
 .../expected/cic_reset_snapshots.out          |  28 +
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |  42 ++
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/expected/rules.out           |  17 +-
 src/test/regress/sql/create_index.sql         |  21 +
 21 files changed, 1193 insertions(+), 402 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 16646f560e8..be2d3d5a6db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6253,6 +6253,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6293,13 +6305,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6316,8 +6327,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 208389e8006..e33345f6a34 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -614,25 +614,24 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
-    significantly longer to complete.  However, since it allows normal
+    <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
+    This method requires more total work than a standard index build and takes
+    longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
     and I/O load imposed by the index creation might slow other operations.
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
+    In a concurrent index build, the main and auxiliary indexes is actually entered as an
     <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -645,10 +644,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -658,11 +658,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..6a05620bd67 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,11 +368,10 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
-    rebuild and takes significantly longer to complete as it needs to wait
+    rebuild and takes longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
     it allows normal operations to continue while the index is being rebuilt, this
     method is useful for rebuilding indexes in a production environment. Of
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index fa582d3e2d6..2e3e8a678c9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,450 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Now take the snapshot that will be used by to filter candidate
+	 * tuples.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
+
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 44e5bc30d3e..8df0b472e88 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -743,7 +748,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -754,11 +760,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +797,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1407,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1462,7 +1473,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1472,6 +1484,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2468,7 +2629,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2528,7 +2690,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3284,12 +3447,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3299,18 +3471,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (ut these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3318,12 +3493,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3341,22 +3518,27 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3389,12 +3571,16 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
 
 	/* mark build is concurrent just for consistency */
@@ -3413,15 +3599,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3444,27 +3645,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3473,8 +3680,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3533,6 +3744,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3804,6 +4020,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4046,6 +4269,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4071,6 +4295,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..ad15db57fd8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1274,16 +1274,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4b50d6ee8cf..51db7f23378 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,10 +588,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -833,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1257,7 +1272,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1599,6 +1615,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1627,11 +1653,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1641,7 +1667,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1680,7 +1706,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1692,15 +1718,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1728,43 +1778,31 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1787,12 +1825,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1817,6 +1855,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3537,6 +3622,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3642,8 +3728,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3695,8 +3788,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3757,6 +3857,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3860,15 +3967,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3919,6 +4029,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3932,12 +4047,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3946,6 +4066,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3964,10 +4085,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4048,13 +4173,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4097,24 +4264,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4129,13 +4324,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4147,16 +4335,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4176,7 +4356,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4266,14 +4446,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4298,6 +4478,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4311,11 +4513,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4335,6 +4537,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 2d0c7a53563..a53779ae2aa 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -787,7 +787,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -803,6 +803,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -828,7 +829,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b1920999f12..1b2ef8f8002 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -715,11 +715,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1863,22 +1863,25 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both state and auxstate.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..01f85e57ea2 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8c0ad96e02c..4826c1a5538 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -187,8 +187,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index bd5f002cf20..34362e3d875 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3049,6 +3050,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3061,8 +3063,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3090,6 +3094,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..09cfe799efa 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,14 +2020,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index be570da08a0..fcff5d19998 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1250,10 +1251,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1265,6 +1268,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v16-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch (28.7K, 5-v16-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch)
  download | inline diff:
From 1244e33a5e24b4c34529e7a5e5028174480aae49 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v16 11/12] Add proper handling of auxiliary indexes during
 DROP/REINDEX operations

During concurrent index operations, an auxiliary index may be created to help
with the process. In case of error during the building process (for example in case of index constraint violation) such indexes became junk-indexes without any function. This patch improves the handling of such auxiliary indexes:

* Add auxiliaryForIndexId parameter to index_create() to track dependencies
* Automatically drop auxiliary indexes when the main index is dropped
* Delete junk auxiliary indexes properly during REINDEX operations
* Add regression tests to verify new behaviour
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 54566223cb0..fb7cd15f5fe 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -661,10 +661,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 64c633e0398..c6db5d57167 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 18316a3968b..ab4c3e2fb4a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 39d2f474865..c9eaa169274 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -687,6 +687,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -733,6 +735,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -775,6 +778,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1176,6 +1181,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1458,6 +1472,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1608,6 +1623,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3832,6 +3848,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3888,6 +3905,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4176,7 +4206,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4265,13 +4296,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4297,18 +4345,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 05a63e21475..782aaffa7bc 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1254,7 +1254,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3588,6 +3588,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 	} ReindexIndexInfo;
@@ -3936,6 +3937,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3943,6 +3945,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4005,12 +4008,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4020,6 +4028,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4040,10 +4049,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4200,7 +4217,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4219,6 +4237,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4401,6 +4422,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4446,6 +4469,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 59156a1c1f6..df152c8466d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1491,6 +1491,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1551,9 +1553,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1605,6 +1618,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1633,12 +1674,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 01f85e57ea2..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 34362e3d875..8aa6815b37c 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3117,20 +3117,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index fcff5d19998..5e5cf23d97d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1279,11 +1279,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v16-0012-Updates-index-insert-and-value-computation-logic.patch (2.2K, 6-v16-0012-Updates-index-insert-and-value-computation-logic.patch)
  download | inline diff:
From 827d63ce910b5a7328d4c79dbc8480de60d4fef6 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v16 12/12] Updates index insert and value computation logic to
 optimize auxiliary index handling.

* Skip index value computation for auxiliary indices since they are not needed
* Set indexUnchanged=false for auxiliary indices to avoid unnecessary checks
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c9eaa169274..7ac6d3af606 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2931,6 +2931,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f2a74b76465..eef1b35e68c 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -441,11 +441,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v16-0010-Remove-PROC_IN_SAFE_IC-optimization.patch (21.4K, 7-v16-0010-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 704bc5d6ccdba0e5346329642c9755778fd11bec Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v16 10/12] Remove PROC_IN_SAFE_IC optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccf74c0e1b6..1a0f7d13ece 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2891,11 +2891,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index fb4c4a31c74..49a57493340 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2094,11 +2094,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index bb4e0fbb675..4336cdb756c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 85f83a97a1f..05a63e21475 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1187,10 +1180,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1677,10 +1666,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1735,9 +1720,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1767,10 +1749,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1796,9 +1774,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1815,9 +1791,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1858,10 +1831,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1882,10 +1851,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3625,7 +3590,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3997,17 +3961,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4067,7 +4020,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4160,11 +4112,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4195,10 +4142,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4207,11 +4150,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4236,10 +4174,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4259,11 +4193,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4284,10 +4213,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4320,10 +4245,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4351,9 +4272,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4375,13 +4293,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4437,12 +4348,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4504,12 +4409,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4769,36 +4668,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 114eb1f8f76..7f6a9ccf126 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 19d26408c2a..82acf3006bd 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
+REGRESS = injection_points hashagg cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8476bfe72a7..bddf22df3ac 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v16-0007-tuplestore-add-support-for-storing-Datum-values.patch (17.3K, 8-v16-0007-tuplestore-add-support-for-storing-Datum-values.patch)
  download | inline diff:
From e6cd266ab480c139cdf69103e7e8f3b18326e93a Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v16 07/12] tuplestore: add support for storing Datum values

Add ability to store and retrieve individual Datum values in tuplestore, optimizing storage based on type:

- Fixed-length: stores raw bytes without length prefix
- Variable-length: includes length prefix/suffix
- By-value types handled inline

This extends tuplestore beyond just handling tuples, planned to be used in next patch.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index d61b601053c..03434f3ea49 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index ed7c454f44e..1f431863387 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v16-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch (39.0K, 9-v16-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch)
  download | inline diff:
From 205008ee146ac36801b9810331226a99448027de Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v16 05/12]  Allow snapshot resets in concurrent unique index  
 builds

 Previously, concurrent unique index builds used a fixed snapshot for the entire
 scan to ensure proper uniqueness checks. This could delay vacuum's ability to
 clean up dead tuples.

 Now reset snapshots periodically during concurrent unique index builds, while
 still maintaining uniqueness by:

 1. Ignoring dead tuples during uniqueness checks in tuplesort
 2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

 This improves vacuum effectiveness during long-running index builds without
 compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 263 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 22274f095ac..fa582d3e2d6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index cbe73675f86..5db6d237c2c 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 810f80fc8e6..f7914ebb3d0 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 693e43c674b..f9695fba8b5 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -51,8 +51,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2828,7 +2826,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -2946,17 +2944,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -2982,6 +2987,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -3001,7 +3008,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -3012,7 +3019,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -3021,6 +3029,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -3029,7 +3039,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -3046,6 +3057,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 482d9a1786d..e369ad0b723 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3301,9 +3301,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 36b875945d3..4b50d6ee8cf 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1700,8 +1700,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb8601e2257..18f90d46a73 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -133,6 +134,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -359,6 +361,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -401,6 +404,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1655,6 +1659,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1664,18 +1669,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e4fdeca3402..d22a9797ad0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1314,8 +1314,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 313394d92c6..b1920999f12 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1800,9 +1800,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v16-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch (37.0K, 10-v16-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch)
  download | inline diff:
From a78a10d9d8bce018fc178ef9e3d787fb627daed8 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v16 06/12] Add STIR (Short-Term Index Replacement) access
 method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3b91d02605a..134636c4cc9 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3074,6 +3074,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3125,6 +3126,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..01f3b660f4b
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e369ad0b723..44e5bc30d3e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3411,6 +3411,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b5fbdcbd82..9ab60f37570 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..286fcccec3d 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index dbbc2f1e30d..2d0c7a53563 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -828,6 +828,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1be8739573f..44f8a0d5606 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 43445cdcc6c..26ddd5ec577 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 134b3dd8689..ac1e7e7c7f2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a323fa98bbb..8c0ad96e02c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -182,12 +182,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,6 +217,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 6543e90de75..fcd8a7c556f 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5136,7 +5136,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5150,7 +5151,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5175,9 +5177,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5186,12 +5188,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5200,7 +5203,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v16-0004-Allow-snapshot-resets-during-parallel-concurrent.patch (41.5K, 11-v16-0004-Allow-snapshot-resets-during-parallel-concurrent.patch)
  download | inline diff:
From 18b28c955ff0e862caae8c54d2cfbc0935fdf50d Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v16 04/12] Allow snapshot resets during parallel concurrent
 index builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 08dc35dd8df..ccf74c0e1b6 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1218,7 +1217,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1251,7 +1249,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1266,6 +1263,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2366,7 +2364,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2397,25 +2394,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2455,8 +2452,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2481,7 +2476,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2527,7 +2523,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2543,6 +2538,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2551,7 +2553,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2574,9 +2577,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2776,14 +2776,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2805,6 +2805,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2945,6 +2946,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index f6f40c2f53f..fb4c4a31c74 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
 static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1778,14 +1778,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1808,6 +1808,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2167,6 +2168,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f17c5dbacaa..22274f095ac 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b490da0eeee..810f80fc8e6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..277c79dd554 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0a153c6f746..482d9a1786d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 6f9e991eeae..bc639964ada 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3d018c3a1e8..4cd536e988c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index dc6e0184284..8529b808aed 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5393b30c57e..313394d92c6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1181,7 +1181,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1799,9 +1800,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v16-0002-Add-stress-tests-for-concurrent-index-operations.patch (8.1K, 12-v16-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From e5bbf8457ce5947616cfaab2dc59d1099ba58b3c Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v16 02/12] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 190 ++++++++++++++++++++++++++++++++
 2 files changed, 191 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..0d755373ee4
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,190 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for  GIN/GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 4)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIN (ia);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIST (p);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING BRIN (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING HASH (updated_at);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v16-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch (46.8K, 13-v16-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch)
  download | inline diff:
From 585c27c1260b7d26c5357933face681a41371804 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v16 03/12] Allow advancing xmin during non-unique,
 non-parallel concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index aac8c74f546..63a08fbe615 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 75a65ec9c75..08dc35dd8df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1213,11 +1213,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1230,6 +1231,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1249,6 +1251,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2372,6 +2375,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2397,9 +2401,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2442,6 +2453,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2521,6 +2534,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2537,6 +2552,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b2f89cad880..f6f40c2f53f 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 4c83b09edde..0bc93d86460 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -196,6 +196,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fa7935a0ed3..def4fe20d1e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -570,6 +571,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -611,7 +642,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1256,6 +1292,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e78682c3cef..f17c5dbacaa 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 07bae342e25..0d262a4188d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7aba852db90..b490da0eeee 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8e1741c81f5..0a153c6f746 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3214,7 +3231,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3277,12 +3295,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6a72e566d4a..36b875945d3 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1700,23 +1700,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4079,9 +4073,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4096,7 +4087,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 36ee6dd43de..e0d82d17918 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6789,6 +6790,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6844,6 +6846,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6901,6 +6908,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..f5bb04d5bd1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 131c050c15f..5393b30c57e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -936,7 +948,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -944,6 +957,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1776,6 +1798,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..19d26408c2a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc
+REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..8476bfe72a7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -37,6 +37,7 @@ tests += {
       'injection_points',
       'hashagg',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v16-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (17.5K, 14-v16-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From c5395363e7061de275eb2ad359bc488e4243f71d Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v16 01/12] This is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 6 files changed, 216 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 32ff3ca9a28..6a72e566d4a 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1796,6 +1796,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4201,7 +4202,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4280,6 +4281,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 742f3f8c08d..f2a74b76465 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -943,6 +944,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 5cd5e2eeb80..df2420ce8ab 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -487,6 +487,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -697,6 +739,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -707,23 +751,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index b0fe50075ad..d5ad73f6f69 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1158,6 +1159,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 71abb01f655..af7586a428f 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8f1508b1ee2..3d018c3a1e8 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -388,6 +389,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-04-06 23:45  Mihail Nikalayeu <[email protected]>
  parent: Michail Nikolaev <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Mihail Nikalayeu @ 2025-04-06 23:45 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: [email protected]; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone!

Rebased, updated accordingly to some changes.

Best regards,
Mikhail.


Attachments:

  [application/octet-stream] v17-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch (39.4K, 3-v17-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch)
  download | inline diff:
From 53b2ebc56565017d4c8212ebf6a5773c77c99a04 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v17 05/12] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire scan to ensure proper uniqueness checks. This could delay vacuum's ability to clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without  compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 264 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7273b1aee00..0eaa4df5582 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 08884116aec..347b50d6e51 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2f45ae96c0c..d186ce9ec37 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1320,7 +1432,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1417,7 +1529,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,21 +1546,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1457,16 +1559,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1536,6 +1638,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1550,7 +1653,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1630,7 +1733,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1641,7 +1744,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1744,6 +1847,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1847,11 +1951,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1931,6 +2036,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1953,14 +2059,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9e27302fe81..8d7f5905fc2 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -66,8 +66,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2515,7 +2513,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3805,7 +3803,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3923,17 +3921,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -3959,6 +3964,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -3978,7 +3985,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -3989,7 +3996,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -3998,6 +4006,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4006,7 +4016,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4023,6 +4034,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6432ef55cdc..cca1dbb8e37 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3323,9 +3323,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d687646efed..778d9528c25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,8 +1694,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 5f70e8dddac..71a5c21e0df 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -133,6 +134,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -358,6 +360,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -400,6 +403,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1653,6 +1657,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1662,18 +1667,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..38471e90a0c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1339,8 +1339,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 387c308ec2f..5182013aabd 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1754,9 +1754,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v17-0002-Add-stress-tests-for-concurrent-index-operations.patch (9.2K, 4-v17-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 08102f2f88c1d48a466e688fc388bda4a95ed876 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v17 02/12] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v17-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch (46.8K, 5-v17-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch)
  download | inline diff:
From d51069c170c37445fc0ea30249a60d43c4f0929e Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v17 03/12] Allow advancing xmin during non-unique,
 non-parallel concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..d69b658ef20 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -565,7 +565,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 0d9c2b0b653..a6dad54ff58 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 01e1db7f856..e5a945a1b14 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e25d817c195..0d5792ff7ff 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ed2e3021799..4d28070a210 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -612,6 +613,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -653,7 +684,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1304,6 +1340,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..8a584db595a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 8f532e14590..42921020316 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3794cc924ad..f3986d086b6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1409,6 +1419,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1434,9 +1445,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1490,6 +1508,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1584,6 +1604,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1600,6 +1622,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..cbd0ba9aa01 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1544,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3253,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3317,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index bb0155fdc24..d687646efed 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,23 +1694,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4073,9 +4067,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4090,7 +4081,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 566ce5b3cb4..445ca375e9e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6853,6 +6854,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6908,6 +6910,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6965,6 +6972,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e48fe434cd3..6caad42ea4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..7e8fa5e1b57 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -62,6 +63,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -893,7 +905,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -901,6 +914,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1730,6 +1752,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..19d26408c2a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc
+REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..8476bfe72a7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -37,6 +37,7 @@ tests += {
       'injection_points',
       'hashagg',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v17-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (17.5K, 6-v17-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From cb883df416d52c5a151a8a25ba4d593735075d72 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v17 01/12] This is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 6 files changed, 216 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 33c2106c17c..bb0155fdc24 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1790,6 +1790,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4195,7 +4196,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4274,6 +4275,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index e3fe9b78bb5..55491c7b8b1 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -943,6 +944,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 0374476ffad..6e74c9b7e87 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -487,6 +487,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -697,6 +739,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -707,23 +751,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 309e27f8b5f..3f4aee5ba3b 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1158,6 +1159,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 67d879be8b8..215f9786469 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..dcfe16a9824 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -447,6 +448,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
-- 
2.43.0



  [application/octet-stream] v17-0004-Allow-snapshot-resets-during-parallel-concurrent.patch (41.5K, 7-v17-0004-Allow-snapshot-resets-during-parallel-concurrent.patch)
  download | inline diff:
From c57ed5417ce444a2899cd611475816642d30296e Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v17 04/12] Allow snapshot resets during parallel concurrent
 index builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e5a945a1b14..423424e51a2 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0d5792ff7ff..fe0b79a5fdd 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
 static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1778,14 +1778,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1808,6 +1808,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2167,6 +2168,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8a584db595a..7273b1aee00 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f3986d086b6..2f45ae96c0c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1420,6 +1417,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1437,12 +1435,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1450,6 +1457,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1510,7 +1522,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1537,7 +1549,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1613,6 +1626,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1621,7 +1641,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1645,7 +1666,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1895,6 +1916,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1949,11 +1971,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1989,4 +2015,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..277c79dd554 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cbd0ba9aa01..6432ef55cdc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 6f9e991eeae..bc639964ada 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index dcfe16a9824..580ac54856f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -342,14 +342,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7e8fa5e1b57..387c308ec2f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1135,7 +1135,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1753,9 +1754,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v17-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch (37.0K, 8-v17-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch)
  download | inline diff:
From 8ca957043836f924c32783db9bb0ab7f610e62dc Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v17 06/12] Add STIR (Short-Term Index Replacement) access
 method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index a6dad54ff58..ca5214461e6 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad09..232c87ec267 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3092,6 +3092,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3143,6 +3144,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..01f3b660f4b
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cca1dbb8e37..e9e22ec0e84 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3433,6 +3433,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4fffb76e557..38602e6a72d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..286fcccec3d 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..b99916edb4a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -73,6 +73,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index dfbb4c85460..a121b4d31c9 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5d5be8ba4e1..e29cd431659 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5b6cadb5a6c..3850dde4adb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -182,12 +182,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,6 +217,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index cf48ae6d0c2..52dde57680d 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5137,7 +5137,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5151,7 +5152,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5176,9 +5178,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5187,12 +5189,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5201,7 +5204,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v17-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch (110.0K, 9-v17-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch)
  download | inline diff:
From 890a662325aad4703b66305eda43bcd6266349c6 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v17 08/12] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 doc/src/sgml/monitoring.sgml                  |  26 +-
 doc/src/sgml/ref/create_index.sgml            |  33 +-
 doc/src/sgml/ref/reindex.sgml                 |  43 +-
 src/backend/access/heap/README.HOT            |  15 +-
 src/backend/access/heap/heapam_handler.c      | 592 ++++++++++++------
 src/backend/catalog/index.c                   | 312 +++++++--
 src/backend/catalog/system_views.sql          |  17 +-
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 376 ++++++++---
 src/backend/nodes/makefuncs.c                 |   4 +-
 src/include/access/tableam.h                  |  31 +-
 src/include/catalog/index.h                   |  12 +-
 src/include/commands/progress.h               |  13 +-
 src/include/nodes/execnodes.h                 |   4 +-
 src/include/nodes/makefuncs.h                 |   3 +-
 .../expected/cic_reset_snapshots.out          |  28 +
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |  42 ++
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/expected/rules.out           |  17 +-
 src/test/regress/sql/create_index.sql         |  21 +
 21 files changed, 1194 insertions(+), 402 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c421d89edff..bcf02d511c4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6305,6 +6305,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6345,13 +6357,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6368,8 +6379,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 208389e8006..e33345f6a34 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -614,25 +614,24 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
-    significantly longer to complete.  However, since it allows normal
+    <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
+    This method requires more total work than a standard index build and takes
+    longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
     and I/O load imposed by the index creation might slow other operations.
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
+    In a concurrent index build, the main and auxiliary indexes is actually entered as an
     <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -645,10 +644,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -658,11 +658,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..6a05620bd67 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,11 +368,10 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
-    rebuild and takes significantly longer to complete as it needs to wait
+    rebuild and takes longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
     it allows normal operations to continue while the index is being rebuilt, this
     method is useful for rebuilding indexes in a production environment. Of
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0eaa4df5582..156c429c1af 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,246 +1782,451 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Now take the snapshot that will be used by to filter candidate
+	 * tuples.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
+
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e9e22ec0e84..a93de2ba9a3 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -744,7 +749,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -755,11 +761,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +798,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1408,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1463,7 +1474,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1473,6 +1485,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2469,7 +2630,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2529,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3306,12 +3469,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3321,18 +3493,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (ut these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3340,12 +3515,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3363,22 +3540,27 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3411,12 +3593,16 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
 
 	/* mark build is concurrent just for consistency */
@@ -3435,15 +3621,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3466,27 +3667,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3495,8 +3702,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3555,6 +3766,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3826,6 +4042,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4068,6 +4291,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4093,6 +4317,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..2593ff28689 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1275,16 +1275,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 778d9528c25..2c2ae798c57 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,10 +588,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -833,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1251,7 +1266,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1593,6 +1609,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1621,11 +1647,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1635,7 +1661,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1674,7 +1700,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1686,15 +1712,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1722,43 +1772,31 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1781,12 +1819,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1811,6 +1849,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3531,6 +3616,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3636,8 +3722,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3689,8 +3782,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3751,6 +3851,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3854,15 +3961,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3913,6 +4023,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3926,12 +4041,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3940,6 +4060,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3958,10 +4079,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4042,13 +4167,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4091,24 +4258,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4123,13 +4318,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4141,16 +4329,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4170,7 +4350,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4260,14 +4440,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4292,6 +4472,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4305,11 +4507,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4329,6 +4531,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5182013aabd..4cf51e946ed 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,11 +708,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1817,22 +1817,25 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both state and auxstate.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..01f85e57ea2 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3850dde4adb..76f25ec686f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -187,8 +187,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 9ade7b835e6..ca74844b5c6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..e1474096aa6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2037,14 +2037,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index e21ff426519..2cff1ac29be 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v17-0007-tuplestore-add-support-for-storing-Datum-values.patch (17.3K, 10-v17-0007-tuplestore-add-support-for-storing-Datum-values.patch)
  download | inline diff:
From f2c8fda7492dd7ce9b9f9fb25c6d61dc8a8d5a7d Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v17 07/12] tuplestore: add support for storing Datum values

Add ability to store and retrieve individual Datum values in tuplestore, optimizing storage based on type:

- Fixed-length: stores raw bytes without length prefix
- Variable-length: includes length prefix/suffix
- By-value types handled inline

This extends tuplestore beyond just handling tuples, planned to be used in next patch.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..12ae705c091 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index ed7c454f44e..1f431863387 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v17-0012-Updates-index-insert-and-value-computation-logic.patch (2.2K, 11-v17-0012-Updates-index-insert-and-value-computation-logic.patch)
  download | inline diff:
From b99a8a5daac028360a467a41d0ae33207c161a78 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v17 12/12] Updates index insert and value computation logic to
 optimize auxiliary index handling.

* Skip index value computation for auxiliary indices since they are not needed
* Set indexUnchanged=false for auxiliary indices to avoid unnecessary checks
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 82f02df7430..343f74a40e8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2932,6 +2932,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 55491c7b8b1..7c302cafc81 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -441,11 +441,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v17-0009-Concurrently-built-index-validation-uses-fresh-s.patch (14.7K, 12-v17-0009-Concurrently-built-index-validation-uses-fresh-s.patch)
  download | inline diff:
From 2f33c094f05e31dfdfdecc915b8eceb4ccb2e8ef Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 17:21:29 +0100
Subject: [PATCH v17 09/12] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/heapam_handler.c | 77 +++++++++++++++---------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 20 ++++--
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 +++++
 8 files changed, 104 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e33345f6a34..54566223cb0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -868,9 +868,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6a05620bd67..64c633e0398 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -495,10 +495,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 156c429c1af..32a66f3e6f6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1795,8 +1795,8 @@ heapam_index_build_range_scan(Relation heapRelation,
  */
 static int
 heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
-										   Tuplesortstate  *aux,
-										   Tuplestorestate *store)
+									  Tuplesortstate  *aux,
+									  Tuplestorestate *store)
 {
 	int				num = 0;
 	/* state variables for the merge */
@@ -2052,7 +2052,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot resert at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2063,9 +2064,35 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
-	 * Now take the snapshot that will be used by to filter candidate
-	 * tuples.
+	 * sanity checks
+	 */
+	Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+															  auxState->tuplesort,
+															  tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
 	 *
 	 * Beware!  There might still be snapshots in use that treat some transaction
 	 * as in-progress that our temporary snapshot treats as committed.
@@ -2081,33 +2108,10 @@ heapam_index_validate_scan(Relation heapRelation,
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
 	limitXmin = snapshot->xmin;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
-	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
-
-	/*
-	 * sanity checks
-	 */
-	Assert(OidIsValid(indexRelation->rd_rel->relam));
-
-	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
-														 auxState->tuplesort,
-														 tuples_for_check);
-
-	/* It is our responsibility to sloe tuple sort as fast as we can */
-	tuplesort_end(state->tuplesort);
-	tuplesort_end(auxState->tuplesort);
-
-	state->tuplesort = auxState->tuplesort = NULL;
-
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2145,6 +2149,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2199,6 +2204,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+
+		if (page_read_counter % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d186ce9ec37..8d755470e8c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 81171f35451..d721fa45a0c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -958,6 +959,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -998,6 +1003,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index a93de2ba9a3..bfd6a0a37f0 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3507,8 +3507,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3521,7 +3522,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3604,6 +3605,7 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3639,6 +3641,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3648,6 +3653,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3667,9 +3675,11 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
-
-	PopActiveSnapshot();
+	/* tuplesort_performsort may require catalog snapshot */
 	InvalidateCatalogSnapshot();
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2c2ae798c57..4f9ccc9ca8d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4343,7 +4343,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [application/octet-stream] v17-0010-Remove-PROC_IN_SAFE_IC-optimization.patch (21.4K, 13-v17-0010-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From a7b2b723926d5fc14ddd650a5fb5b5ea235331c0 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v17 10/12] Remove PROC_IN_SAFE_IC optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 423424e51a2..93ad3f3f632 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index fe0b79a5fdd..4c602f74955 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2094,11 +2094,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8d755470e8c..00c86bfcfc6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1910,11 +1910,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4f9ccc9ca8d..be396368a09 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1181,10 +1174,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1671,10 +1660,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1729,9 +1714,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1761,10 +1743,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1790,9 +1768,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1809,9 +1785,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1852,10 +1825,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1876,10 +1845,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3619,7 +3584,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3991,17 +3955,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4061,7 +4014,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4154,11 +4106,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4189,10 +4136,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4201,11 +4144,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4230,10 +4168,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4253,11 +4187,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4278,10 +4207,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4314,10 +4239,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4345,9 +4266,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4369,13 +4287,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4431,12 +4342,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4498,12 +4403,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4763,36 +4662,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index f51b03d3822..de271f8ab37 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 19d26408c2a..82acf3006bd 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
+REGRESS = injection_points hashagg cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8476bfe72a7..bddf22df3ac 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v17-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch (28.7K, 14-v17-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch)
  download | inline diff:
From d66f96f6c8c93f27c2f42159388b03bdffa56e0f Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v17 11/12] Add proper handling of auxiliary indexes during
 DROP/REINDEX operations

During concurrent index operations, an auxiliary index may be created to help
with the process. In case of error during the building process (for example in case of index constraint violation) such indexes became junk-indexes without any function. This patch improves the handling of such auxiliary indexes:

* Add auxiliaryForIndexId parameter to index_create() to track dependencies
* Automatically drop auxiliary indexes when the main index is dropped
* Delete junk auxiliary indexes properly during REINDEX operations
* Add regression tests to verify new behaviour
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 54566223cb0..fb7cd15f5fe 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -661,10 +661,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 64c633e0398..c6db5d57167 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 18316a3968b..ab4c3e2fb4a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bfd6a0a37f0..82f02df7430 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -688,6 +688,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -734,6 +736,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -776,6 +779,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1177,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1459,6 +1473,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1609,6 +1624,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3862,6 +3878,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3918,6 +3935,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4206,7 +4236,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4295,13 +4326,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4327,18 +4375,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index be396368a09..a67445dd6a2 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1248,7 +1248,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3582,6 +3582,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 	} ReindexIndexInfo;
@@ -3930,6 +3931,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3937,6 +3939,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3999,12 +4002,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4014,6 +4022,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4034,10 +4043,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4194,7 +4211,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4213,6 +4231,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4395,6 +4416,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4440,6 +4463,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 4397123398e..80d1605cf5a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1528,6 +1528,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1588,9 +1590,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1642,6 +1655,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1670,12 +1711,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 01f85e57ea2..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index ca74844b5c6..aca6ec57ad7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 2cff1ac29be..e1464eaa67c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-04-30 20:01  Mihail Nikalayeu <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Mihail Nikalayeu @ 2025-04-30 20:01 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; Andres Freund <[email protected]>; +Cc: [email protected]; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, Andres!

This is a gentle ping [1] about the patch related to optimization of
RE|CREATE INDEX CONCURRENTLY. Below is an explanation of the patch
set.

QUICK INTRO
What patch set does in a few words: "CIC/RIC are 2x-3x faster and does
not prevent xmin horizon to advance, without any dirty tricks, even
with removing one of them".
How it works in a few words: "Reset snapshot between pages during the
first phase. Replaces the second phase using a special auxiliary index
to collect TIDs of tuples that need to be inserted into the target
index after the first phase".
What are drawbacks: "some additional complexity + additional auxiliary
index-like structure involved."

SOME HISTORY
In 2021 Álvaro proposed [2] and committed [3] the feature: VACUUM
ignores snapshots involved into concurrent indexing operations. This
was a great feature in PG14.
But in 2022 a bug related to the tuples missing in indexes was
detected, and a little bit later explained by Andres [4]. As a result,
feature was reverted [5] with Álvaro's comment[6]:

> Deciding to revert makes me sad, because this feature is extremely
> valuable for users.  However, I understand the danger and I don't
> disagree with the rationale so I can't really object.

There were some proposals, like introducing a special horizon for
HOT-pruning or stopping it during the CIC, but after some discussions
Andres said [7]:

> I'm also doubtful it's the right approach.
> The problem here comes from needing a snapshot for the entire duration of the validation scan
> ISTM that we should work on not needing that snapshot, rather than trying to reduce the consequences of holding that snapshot.
> I think it might be possible to avoid it.

So, given these two ideas I began the work on the patch.

STRUCTURE

Patch itself contains 4 parts, some of them may be reviewed/committed
separately. All commit messages are detailed and contain additional
explanation of changes.

To not confuse CFBot, commits are presented in the following way: part
1, 2, 3 and 4. If you want only part 3 to test/review – check the
files with "patch_" extensions. They differ a little bit, but changes
are minor.

PART 1
Test set (does not depend on anything)

This is a set of stress tests and some fixes required for those tests
to reliably succeed (even on current master branch). That part is not
required for any other parts – its goal is to make sure everything is
still working correctly while applying other parts/commits.
It includes:
- fixes related to races in case of ON CONFLICT UPDATE + REINDEX
CONCURRENTLY (issue was discovered during testing of that patch) [8]
- fixes in bt_index_parent_check (issue was discovered during testing
of that patch with enormous amount of pain – I was looking for months
for error in patch because of single fail of bt_index_parent_check but
it was an issue with bt_index_parent_check itself) [9].

PART 2
Resetting snapshots during the first phase of CIC (does not depend on anything)

It is based on Matthias' idea [10] - to just reset snapshots every so
often during a concurrent index build. It may work only during the
first scan (because we'll miss some tuples anyway during validation
scan with such approach).
Logic is simple – since the index built by the first scan already
misses a lot of tuples – we may not worry to miss a few more – the
validation phase is going to fix it anyway. Of course it is not so
simple in case of unique indexes, but still possible.

Once committed: xmin is advanced at least during the first phase of
concurrent index build.

Commits are:
- Reset snapshots periodically in non-unique non-parallel concurrent
index builds

Apply this technique to the simplest case – non-unique and
non-parallel. Snapshot is changed "between" pages.
One possible place here to worry about – to ensure xmin advanced we
need to call InvalidateCatalogSnapshot during each snapshot switch.
So, theoretically it may cause some issues, but the table is locked to
changes during the process. At least commit [3] (which ignored xmin of
CIC backend) did the same thing actually.
Another more "clear" option here - we may just extract a separate
catalog snapshot horizon (one more field near xmin specially only for
catalog snapshot), it seems to be a pretty straightforward change).

- Support snapshot resets in parallel concurrent index builds

Extend that technique to parallel builds. It is mostly about ensuring
workers have an initial snapshot restored from the leader before the
leader goes to reset it.

- Support snapshot resets in concurrent builds of unique indexes

The most tricky commit in the second part – apply that to unique
indexes. Changing of snapshots may cause issues with validation of
unique constraints. Currently validation is done during the sorting of
tuples, but that doesn't work with tuples read with different
snapshots (some of them are dead already). To deal with it:
- in case we see two identical tuples during tuplesort – ignore if
some of them are dead according to SnapshotSelf, but fail if two are
alive. It is not a required part, it is just mechanics for fail-fast
behavior and may be removed.
- to provide the guarantee – during _bt_load compare the inserted
index value with previously inserted. If they are equal – make sure
only a single SnapshotSelf alive tuple exists in the whole equal
"group" (it may include more than two tuples in general).

Theoretically it may affect performance of _bt_load because of
_bt_keep_natts(_fast) call for each tuple, but I was unable to notice
any significant difference here. Anyway it is compensated by Part 3
for sure.

PART 3
STIR-based validation phase CIC (does not depend on anything)

That part is about a way to replace the second phase of CIC in a more
effective way (and with the ability to allow horizon advance as an
additional bonus).

The role of the second phase is to find tuples which are not present
in the index built by the first scan, because:
- some of them were too new for the snapshot used during the first phase
- even if we were to use SnapshotSelf to accept all alive tuples –
some of them may be inserted in pages already visited by the scan

The main idea is:
- before starting the first scan lets prepare a special auxiliary
super-lightweight index (it is not even an index or access method,
just pretends to be) with the same columns, expressions and predicates
- that access method (Short Term Index Replacement – STIR) just
appends TID of new coming tuples, without WAL, minimum locking,
simplest append-only structure, without actual indexed data
- it remembers all new TIDs inserted to the table during the first phase
- once our main (target) index receives updates itself we may safely
clear "ready" flag on STIR
- if our first phase scan missed something – it is guaranteed to be
present in that STIR index
- so, instead of requirement to compare the whole table to the index,
we need only to compare to TIDs stored in the STIR
- as a bonus we may reset snapshots during the comparison without risk
of any issues caused by HOT pruning (the issue [4] caused revert of
[3]).

That approach provides a significant performance boost in terms of
time required to build the index. STIR itself theoretically causes
some performance impact, but I was not able to detect it. Also, some
optimizations are applied to it (see below). Details of benchmarks are
presented below as well.

Commits are:
- Add STIR access method and flags related to auxiliary indexes

This one adds STIR code and some flags to distinguish real and
auxiliary indexes.

- Add Datum storage support to tuplestore

Add ability to store Datum in tuplestore. It is used by the following
commits to leverage performance boost from prefetching of the pages
during validation phase.

- Use auxiliary indexes for concurrent index operations

The main part is here. It contains all the logic for creation of
auxiliary index, managing its lifecycle, new validation phase and so
on (including progress reporting, some documentation updates, ability
to have unlogged index for logged table, etc). At the same time it
still relies on a single referenced snapshot during the validation
phase.

- Track and drop auxiliary indexes in DROP/REINDEX

That commit adds different techniques to avoid any additional
administration requirements to deal with auxiliary indexes in case of
error during the index build (junk auxiliary indexes). It adds
dependency tracking, special logic for handling REINDEX calls and
other small things to make administrator's life a little bit easier.

- Optimize auxiliary index handling

Since the STIR index does not contain any actual data we may skip
preparation of that during tuple insert. Commit implements such
optimization.

- Refresh snapshot periodically during index validation

Adds logic to the new validation phase to reset the snapshot every so
often. Currently it does it every 4096 pages visited.

PART 4 (depends on part 2 and part 3)

Commits are:

- Remove PROC_IN_SAFE_IC optimization

This is a small part which makes sense in case both parts 2 and 3 were
applied. Once it's done – CIC does not prevent the horizon from
advancing regularly.
It makes the PROC_IN_SAFE_IC optimization [11] obsolete, because one
CIC now has no issue waiting for the xmin of the other (because it
advances regularly).

BENCHMARKS

I have spent a lot of time benchmarking the patch in different
environments (local SSD, local SSD with 1ms delay, io2 AWS) and the
results look impressive.
I can't measure any performance (or significant space usage)
degradation because of STIR index presence, but performance boost
because of new validation phases gives up to 3x -4x time boost. And
without any VACUUM-related issues during that time (so, other
operations on the databases will benefit from that easily compensating
additional STIR-related cost).

Description of benchmarks are available here [12].

Some results are here: [13] and here [14], code is here [15].

There is also a Discord thread here [16].

Feel free to ask any question and request benchmarks for some scenarios.

Best regards,
Mikhail.

[1]: https://discord.com/channels/1258108670710124574/1334565506149253150/1339368558408372264
[2]: https://www.postgresql.org/message-id/flat/20210115142926.GA19300%40alvherre.pgsql#0988173cb0cf4b8eb...
[3]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d
[4]: https://www.postgresql.org/message-id/flat/20220524190133.j6ee7zh4f5edt5je%40alap3.anarazel.de#17814...
[5]: https://github.com/postgres/postgres/commit/e28bb885196916b0a3d898ae4f2be0e38108d81b
[6]: https://www.postgresql.org/message-id/flat/202205251643.2py5jjpaw7wy%40alvherre.pgsql#589508d30b480b...
[7]: https://www.postgresql.org/message-id/flat/20220525170821.rf6r4dnbbu4baujp%40alap3.anarazel.de#accf6...
[8]: https://commitfest.postgresql.org/patch/5160/
[9]: https://commitfest.postgresql.org/patch/5438/
[10]: https://www.postgresql.org/message-id/flat/CAEze2WgW6pj48xJhG_YLUE1QS%2Bn9Yv0AZQwaWeb-r%2BX%3DHAxU_g...
[11]: https://github.com/postgres/postgres/commit/c98763bf51bf610b3ee7e209fc76c3ff9a6b3163
[12]: https://www.postgresql.org/message-id/flat/CANtu0ojHAputNCH73TEYN_RUtjLGYsEyW1aSXmsXyvwf%3D3U4qQ%40m...
[13]: https://www.postgresql.org/message-id/flat/CANtu0ojiVez054rKvwZzKNhneS2R69UXLnw8N9EdwQwqfEoFdQ%40mai...
[14]: https://docs.google.com/spreadsheets/d/1UYaqpsWSfYdZdQxaqY4gVo0RW6KrT0d-U1VDNJB8lVk/edit?usp=sharing
[15]: https://gist.github.com/michail-nikolaev/b33fb0ac1f35729388c89f72db234b0f
[16]: https://discord.com/channels/1258108670710124574/1259884843165155471/1334565506149253150


Attachments:

  [image/png] bench.png (44.9K, 2-bench.png)
  download | view image

  [application/x-patch] v18-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch (41.2K, 3-v18-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch)
  download | inline diff:
From f75ff526ffcc1c270814d6f5e80c35991c6b5ab4 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v18 04/12] Support snapshot resets in parallel concurrent
 index builds

Extend periodic snapshot reset support to parallel builds, previously limited to non-parallel operations. This allows the xmin horizon to advance during parallel concurrent index builds as well.

The main limitation of applying that technic to parallel builds was a requirement to wait until workers processes restore their initial snapshot from leader.

To address this, following changes applied:
- add infrastructure to track snapshot restoration in parallel workers
- extend parallel scan initialization to support periodic snapshot resets
- wait for parallel workers to restore their initial snapshots before proceeding with scan
- relax limitation for parallel worker to call GetLatestSnapshot
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e5a945a1b14..423424e51a2 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 90d73b9f712..274c38d12bb 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *state);
 static double _gin_parallel_merge(GinBuildState *state);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1778,14 +1778,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1808,6 +1808,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2167,6 +2168,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8a584db595a..7273b1aee00 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f3986d086b6..2f45ae96c0c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1420,6 +1417,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1437,12 +1435,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1450,6 +1457,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1510,7 +1522,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1537,7 +1549,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1613,6 +1626,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1621,7 +1641,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1645,7 +1666,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1895,6 +1916,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1949,11 +1971,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1989,4 +2015,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..277c79dd554 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cbd0ba9aa01..6432ef55cdc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index ed35c58c2c3..8a15dd72b91 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index dcfe16a9824..580ac54856f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -342,14 +342,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7e8fa5e1b57..387c308ec2f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1135,7 +1135,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1753,9 +1754,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/x-patch] v18-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch (46.1K, 4-v18-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch)
  download | inline diff:
From 35abe1014921abe20bd266b5507df2e4e8e685ff Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v18 03/12] Reset snapshots periodically in non-unique
 non-parallel concurrent index builds

Long-living snapshots used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon. Commit d9d076222f5b attempted to allow VACUUM to ignore such snapshots to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces an alternative by periodically resetting the snapshot used during the first phase. By resetting the snapshot every N pages during the heap scan, it allows the xmin horizon to advance.

Currently, this technique is applied to:

- only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness
- non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a following commits
- non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, will be addressed in a following commits

A new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset "between" every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3048e044aec..e59197bb35e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -558,7 +558,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 0d9c2b0b653..a6dad54ff58 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 01e1db7f856..e5a945a1b14 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a7b7b5996e3..90d73b9f712 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c1a4de14a59..4e99b6f44c5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -612,6 +613,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -653,7 +684,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1304,6 +1340,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..8a584db595a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 8f532e14590..42921020316 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3794cc924ad..f3986d086b6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1409,6 +1419,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1434,9 +1445,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1490,6 +1508,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1584,6 +1604,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1600,6 +1622,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..cbd0ba9aa01 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1544,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3253,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3317,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index bb0155fdc24..d687646efed 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,23 +1694,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4073,9 +4067,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4090,7 +4081,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index beafac8c0b0..e690d6c6cec 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6896,6 +6897,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6951,6 +6953,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7008,6 +7015,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e48fe434cd3..6caad42ea4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..7e8fa5e1b57 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -62,6 +63,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -893,7 +905,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -901,6 +914,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1730,6 +1752,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..19d26408c2a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc
+REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..8476bfe72a7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -37,6 +37,7 @@ tests += {
       'injection_points',
       'hashagg',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/x-patch] v18-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (25.3K, 5-v18-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From 723860e7d96521efb36a5003d9d787be59ce374a Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v18 01/12] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/meson.build                   |   1 +
 .../t/006_cic_bt_index_parent_check.pl        |  39 +++++
 contrib/amcheck/verify_nbtree.c               |  68 ++++-----
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 +++++++++++++--
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++++++++-----
 src/backend/utils/time/snapmgr.c              |   2 +
 9 files changed, 285 insertions(+), 88 deletions(-)
 create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index b33e8c9b062..b040000dd55 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -49,6 +49,7 @@ tests += {
       't/003_cic_2pc.pl',
       't/004_verify_nbtree_unique.pl',
       't/005_pitr.pl',
+      't/006_cic_bt_index_parent_check.pl',
     ],
   },
 }
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 00000000000..6e52c5e39ec
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,39 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(q(BEGIN; SELECT pg_current_xact_id();));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+is($result, '0', 'bt_index_parent_check for CIC after removed row');
+
+$in_progress_h->quit;
+done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..3048e044aec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 33c2106c17c..bb0155fdc24 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1790,6 +1790,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4195,7 +4196,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4274,6 +4275,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..6b2e462b70b 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 3f8a4cb5244..f1757d02f1c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -487,6 +487,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -697,6 +739,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -707,23 +751,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 46d533b7288..b2fa3b95855 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1178,6 +1179,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 59233b64730..0c720e450e9 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -716,12 +716,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -756,8 +758,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -769,30 +771,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -815,7 +863,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -835,27 +889,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -875,7 +925,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -883,6 +933,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -920,27 +974,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -948,7 +1010,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..dcfe16a9824 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -447,6 +448,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
-- 
2.43.0



  [application/x-patch] v18-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch (39.4K, 6-v18-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch)
  download | inline diff:
From b47afd37eb728601a4975b749a3c5ce4d2f39081 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v18 05/12] Support snapshot resets in concurrent builds of
 unique indexes

Previously, concurrent builds if unique index used a fixed snapshot for the entire scan to ensure proper uniqueness checks.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:
- ignoring SnapshotSelf dead tuples during uniqueness checks in tuplesort as not a guarantee, but a fail-fast mechanics
- adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values as a guarantee of correctness

Tuples are SnapshotSelf tested only in the case of equal index key values, overwise _bt_load works like before.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 264 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7273b1aee00..0eaa4df5582 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 08884116aec..347b50d6e51 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2f45ae96c0c..d186ce9ec37 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1320,7 +1432,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1417,7 +1529,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,21 +1546,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1457,16 +1559,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1536,6 +1638,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1550,7 +1653,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1630,7 +1733,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1641,7 +1744,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1744,6 +1847,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1847,11 +1951,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1931,6 +2036,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1953,14 +2059,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8b025796127..abe2d8d7f97 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -66,8 +66,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2515,7 +2513,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3805,7 +3803,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3923,17 +3921,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -3959,6 +3964,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -3978,7 +3985,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -3989,7 +3996,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -3998,6 +4006,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4006,7 +4016,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4023,6 +4034,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6432ef55cdc..cca1dbb8e37 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3323,9 +3323,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d687646efed..778d9528c25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,8 +1694,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 5f70e8dddac..71a5c21e0df 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -133,6 +134,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -358,6 +360,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -400,6 +403,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1653,6 +1657,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1662,18 +1667,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..38471e90a0c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1339,8 +1339,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 387c308ec2f..5182013aabd 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1754,9 +1754,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/x-patch] v18-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 7-v18-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 854d5f9545d9ed5cd85c2e3dfd1c817345967be2 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v18 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/x-patch] v18-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.9K, 8-v18-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From ad69a739c72782d740fe6e96c45e4006f859b915 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v18 06/12] Add STIR access method and flags related to
 auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index a6dad54ff58..ca5214461e6 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad09..232c87ec267 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3092,6 +3092,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3143,6 +3144,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..01f3b660f4b
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cca1dbb8e37..e9e22ec0e84 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3433,6 +3433,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4fffb76e557..38602e6a72d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..286fcccec3d 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..b99916edb4a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -73,6 +73,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index dfbb4c85460..a121b4d31c9 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62beb71da28..f05a5eecdda 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5b6cadb5a6c..3850dde4adb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -182,12 +182,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,6 +217,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index cf48ae6d0c2..52dde57680d 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5137,7 +5137,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5151,7 +5152,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5176,9 +5178,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5187,12 +5189,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5201,7 +5204,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/x-patch] v18-0007-Add-Datum-storage-support-to-tuplestore.patch (17.3K, 9-v18-0007-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From f3d21defc328c4118c8468ade82193c2be62eb37 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v18 07/12] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..12ae705c091 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/x-patch] v18-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch (96.7K, 10-v18-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 882ee505e2943c6d6f5bfa7be1f413bd6605d5ef Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v18 08/12] Use auxiliary indexes for concurrent index
 operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 545 +++++++++++++--------
 src/backend/catalog/index.c                | 292 +++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/catalog/toasting.c             |   3 +-
 src/backend/commands/indexcmds.c           | 337 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  28 +-
 src/include/catalog/index.h                |  12 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/execnodes.h              |   4 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 19 files changed, 1104 insertions(+), 351 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d768ea065c5..65cd1de5295 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6305,6 +6305,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6345,13 +6357,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6368,8 +6379,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 147a8f7587c..e7a7a160742 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..57c347f2930 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..6f718feb6d5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0eaa4df5582..633bc245e28 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,243 +1782,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e9e22ec0e84..6c09c6a2b67 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -744,7 +749,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -755,11 +761,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +798,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1408,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1463,7 +1474,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1473,6 +1485,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2469,7 +2630,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2529,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3306,12 +3469,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3321,18 +3493,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3340,12 +3515,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3363,22 +3540,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3411,6 +3592,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3435,15 +3617,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3466,27 +3663,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3495,6 +3695,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3555,6 +3756,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3826,6 +4032,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4068,6 +4281,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4093,6 +4307,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 15efb02badb..edd61c294a6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1288,16 +1288,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 778d9528c25..388c3f92dae 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,6 +588,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -833,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1251,7 +1267,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1593,6 +1610,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1621,11 +1648,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1635,7 +1662,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1674,7 +1701,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1686,14 +1713,38 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We build the index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1722,9 +1773,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1742,24 +1812,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1786,7 +1846,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1811,6 +1871,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3531,6 +3638,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3636,8 +3744,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3689,8 +3804,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3751,6 +3873,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3854,15 +3983,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3913,6 +4045,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3926,12 +4063,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3940,6 +4082,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3958,10 +4101,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4042,13 +4189,56 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4091,6 +4281,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4098,12 +4323,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4141,7 +4360,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4170,7 +4389,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4260,14 +4479,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4292,6 +4511,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4305,11 +4546,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4329,6 +4570,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5182013aabd..abcb147d9b3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,11 +708,12 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	void 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												Snapshot snapshot,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1820,19 +1821,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
 						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  snapshot,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..4713f18e68d 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3850dde4adb..76f25ec686f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -187,8 +187,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 9ade7b835e6..ca74844b5c6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6cf828ca8d0..ed6c20a495c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,14 +2041,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index e21ff426519..2cff1ac29be 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/x-patch] v18-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (28.7K, 11-v18-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From f80157f91d51f379bfdc7bb91ad9c84fbff39ee6 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v18 09/12] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e7a7a160742..298a093f554 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 57c347f2930..634ba55d184 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 18316a3968b..ab4c3e2fb4a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6c09c6a2b67..bf0bb79474b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -688,6 +688,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -734,6 +736,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -776,6 +779,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1177,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1459,6 +1473,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1609,6 +1624,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3842,6 +3858,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3898,6 +3915,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4186,7 +4216,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4275,13 +4306,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4307,18 +4355,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 388c3f92dae..05938ff95e4 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1260,7 +1260,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3639,6 +3639,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3988,6 +3989,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3995,6 +3997,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4068,12 +4071,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4083,6 +4091,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4104,10 +4113,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4288,7 +4305,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4311,6 +4329,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4529,6 +4550,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4580,6 +4603,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2705cf11330..91c04e5bf10 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4713f18e68d..53b2b13efc3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index ca74844b5c6..aca6ec57ad7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 2cff1ac29be..e1464eaa67c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/x-patch] v18-0010-Optimize-auxiliary-index-handling.patch (2.4K, 12-v18-0010-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From fd8bd099f109d85b9d6ebc9bf24bff2525073c6d Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v18 10/12] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bf0bb79474b..d1b96703bbc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2932,6 +2932,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 6b2e462b70b..0d5fa4dd79f 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/x-patch] v18-0011-Refresh-snapshot-periodically-during-index-valid.patch (32.1K, 13-v18-0011-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 1249890d4be801dd1b9146847dfa8bd88f6e2f68 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:18:32 +0200
Subject: [PATCH v18 11/12] Refresh snapshot periodically during index
 validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 doc/src/sgml/ref/create_index.sgml            | 11 ++-
 doc/src/sgml/ref/reindex.sgml                 | 11 ++-
 src/backend/access/heap/README.HOT            |  4 +-
 src/backend/access/heap/heapam_handler.c      | 77 ++++++++++++++++---
 src/backend/access/nbtree/nbtsort.c           |  2 +-
 src/backend/access/spgist/spgvacuum.c         | 12 ++-
 src/backend/catalog/index.c                   | 42 +++++++---
 src/backend/commands/indexcmds.c              | 50 ++----------
 src/include/access/tableam.h                  |  7 +-
 src/include/access/transam.h                  | 15 ++++
 src/include/catalog/index.h                   |  2 +-
 .../expected/cic_reset_snapshots.out          | 28 +++++++
 .../sql/cic_reset_snapshots.sql               |  1 +
 13 files changed, 179 insertions(+), 83 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 298a093f554..6220a80474f 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -881,9 +881,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 634ba55d184..b887574f106 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -498,10 +498,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 6f718feb6d5..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ use the key value from the live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 633bc245e28..dd6994ed98f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2034,23 +2034,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,14 +2064,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2084,6 +2089,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2117,6 +2145,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2172,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2181,9 +2224,25 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d186ce9ec37..8d755470e8c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 81171f35451..d721fa45a0c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -958,6 +959,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -998,6 +1003,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d1b96703bbc..e707b012f41 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3534,8 +3534,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3548,7 +3549,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3569,13 +3570,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3625,8 +3627,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3662,6 +3668,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3671,6 +3680,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3690,19 +3702,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3725,6 +3742,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 05938ff95e4..6c7905a2534 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -592,7 +592,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1794,32 +1793,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1841,8 +1819,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4348,7 +4326,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4363,13 +4340,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4381,16 +4351,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4403,7 +4365,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index abcb147d9b3..3560ee418b1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,10 +708,9 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void 		(*index_validate_scan) (Relation table_rel,
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
 												Relation index_rel,
 												struct IndexInfo *index_info,
-												Snapshot snapshot,
 												struct ValidateIndexState *state,
 												struct ValidateIndexState *aux_state);
 
@@ -1825,18 +1824,16 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
 						  struct ValidateIndexState *state,
 						  struct ValidateIndexState *auxstate)
 {
 	return table_rel->rd_tableam->index_validate_scan(table_rel,
 													  index_rel,
 													  index_info,
-													  snapshot,
 													  state,
 													  auxstate);
 }
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 53b2b13efc3..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -154,7 +154,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
-- 
2.43.0



  [application/x-patch] v18-0012-Remove-PROC_IN_SAFE_IC-optimization.patch (21.2K, 14-v18-0012-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 44c57dbeb7c398ed4b7b3ff899d8c025888c50f7 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v18 12/12] Remove PROC_IN_SAFE_IC optimization

This optimization allowed concurrent index builds to ignore other indexes without expressions or predicates. With the new snapshot handling approach that periodically refreshes snapshots, this optimization is no longer necessary.

The change simplifies concurrent index build code by:
- removing the PROC_IN_SAFE_IC process status flag
- eliminating set_indexsafe_procflags() calls and related logic
- removing special case handling in GetCurrentVirtualXIDs()
- removing related test cases and injection points
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 423424e51a2..93ad3f3f632 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 274c38d12bb..3ccd0797727 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2094,11 +2094,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8d755470e8c..00c86bfcfc6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1910,11 +1910,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6c7905a2534..b76e60eb4a5 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1181,10 +1174,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1671,10 +1660,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1729,9 +1714,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1761,10 +1743,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1790,9 +1768,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1809,9 +1785,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1852,10 +1825,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1876,10 +1845,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3620,7 +3585,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3994,17 +3958,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4070,7 +4023,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
 		newidx->junkAuxIndexId = junkAuxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4171,11 +4123,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4207,10 +4154,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4219,11 +4162,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4248,10 +4186,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4271,11 +4205,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4297,10 +4226,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4336,10 +4261,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4367,9 +4288,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4391,13 +4309,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4453,12 +4364,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4522,12 +4427,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4795,36 +4694,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5e07466c737 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 19d26408c2a..82acf3006bd 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
+REGRESS = injection_points hashagg cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8476bfe72a7..bddf22df3ac 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v18-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_ (36.9K, 15-v18-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_)
  download

  [application/octet-stream] v18-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_ (28.7K, 16-v18-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_)
  download

  [application/octet-stream] v18-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_ (96.9K, 17-v18-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_)
  download

  [application/octet-stream] v18-only-part-3-0005-Optimize-auxiliary-index-handling.patch_ (2.4K, 18-v18-only-part-3-0005-Optimize-auxiliary-index-handling.patch_)
  download

  [application/octet-stream] v18-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_ (17.3K, 19-v18-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_)
  download

  [application/octet-stream] v18-only-part-3-0006-Refresh-snapshot-periodically-during.patch_ (20.7K, 20-v18-only-part-3-0006-Refresh-snapshot-periodically-during.patch_)
  download

^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-05-18 15:09  Mihail Nikalayeu <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Mihail Nikalayeu @ 2025-05-18 15:09 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; Andres Freund <[email protected]>; +Cc: [email protected]; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone!

Rebased version + materials from PGConf.dev 2025 Poster Session :)

Best regards,
Mikhail.


Attachments:

  [application/octet-stream] v19-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (28.7K, 2-v19-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From d28e4c7bbc2980c6d43015126ca88bdcdcc05238 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v19 09/12] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e7a7a160742..298a093f554 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 57c347f2930..634ba55d184 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 18316a3968b..ab4c3e2fb4a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6c09c6a2b67..bf0bb79474b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -688,6 +688,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -734,6 +736,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -776,6 +779,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1177,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1459,6 +1473,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1609,6 +1624,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3842,6 +3858,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3898,6 +3915,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4186,7 +4216,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4275,13 +4306,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4307,18 +4355,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 65fa7fd74e0..354ce8dd463 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1260,7 +1260,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3639,6 +3639,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3988,6 +3989,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3995,6 +3997,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4068,12 +4071,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4083,6 +4091,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4104,10 +4113,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4288,7 +4305,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4311,6 +4329,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4529,6 +4550,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4580,6 +4603,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 54ad38247aa..a1043c183f0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4713f18e68d..53b2b13efc3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index ca74844b5c6..aca6ec57ad7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 2cff1ac29be..e1464eaa67c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v19-0012-Remove-PROC_IN_SAFE_IC-optimization.patch (21.3K, 3-v19-0012-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 854a2d3d5b7389f41c3a9392ad603f074fe77b33 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v19 12/12] Remove PROC_IN_SAFE_IC optimization

This optimization allowed concurrent index builds to ignore other indexes without expressions or predicates. With the new snapshot handling approach that periodically refreshes snapshots, this optimization is no longer necessary.

The change simplifies concurrent index build code by:
- removing the PROC_IN_SAFE_IC process status flag
- eliminating set_indexsafe_procflags() calls and related logic
- removing special case handling in GetCurrentVirtualXIDs()
- removing related test cases and injection points
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 423424e51a2..93ad3f3f632 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 629f6d5f2c0..df79b5850f9 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2106,11 +2106,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8d755470e8c..00c86bfcfc6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1910,11 +1910,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f58e138eed2..2f066f45c62 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1181,10 +1174,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1671,10 +1660,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1729,9 +1714,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1761,10 +1743,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1790,9 +1768,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1809,9 +1785,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1852,10 +1825,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1876,10 +1845,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3620,7 +3585,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3994,17 +3958,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe", NULL);
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe", NULL);
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4070,7 +4023,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
 		newidx->junkAuxIndexId = junkAuxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4171,11 +4123,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4207,10 +4154,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4219,11 +4162,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4248,10 +4186,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4271,11 +4205,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4297,10 +4226,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4336,10 +4261,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4367,9 +4288,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4391,13 +4309,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4453,12 +4364,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4522,12 +4427,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4795,36 +4694,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5e07466c737 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 19d26408c2a..82acf3006bd 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
+REGRESS = injection_points hashagg cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8476bfe72a7..bddf22df3ac 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v19-0010-Optimize-auxiliary-index-handling.patch (2.4K, 4-v19-0010-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From c21b40416b5a0b668aa7dbd1fc994c77685fb18a Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v19 10/12] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bf0bb79474b..d1b96703bbc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2932,6 +2932,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 499cba145dd..c8b51e2725c 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v19-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch (96.8K, 5-v19-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 031d66f94c0756133d0da0bed3b946ac588c8b03 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v19 08/12] Use auxiliary indexes for concurrent index
 operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 545 +++++++++++++--------
 src/backend/catalog/index.c                | 292 +++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/catalog/toasting.c             |   3 +-
 src/backend/commands/indexcmds.c           | 337 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  28 +-
 src/include/catalog/index.h                |  12 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/execnodes.h              |   4 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 19 files changed, 1104 insertions(+), 351 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4265a22d4de..8ccd69b14c2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6314,6 +6314,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6354,13 +6366,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6377,8 +6388,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 147a8f7587c..e7a7a160742 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..57c347f2930 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..6f718feb6d5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0eaa4df5582..633bc245e28 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,243 +1782,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e9e22ec0e84..6c09c6a2b67 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -744,7 +749,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -755,11 +761,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +798,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1408,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1463,7 +1474,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1473,6 +1485,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2469,7 +2630,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2529,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3306,12 +3469,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3321,18 +3493,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3340,12 +3515,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3363,22 +3540,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3411,6 +3592,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3435,15 +3617,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3466,27 +3663,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3495,6 +3695,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3555,6 +3756,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3826,6 +4032,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4068,6 +4281,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4093,6 +4307,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 15efb02badb..edd61c294a6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1288,16 +1288,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 15206d27227..65fa7fd74e0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,6 +588,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -833,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1251,7 +1267,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1593,6 +1610,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1621,11 +1648,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1635,7 +1662,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1674,7 +1701,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1686,14 +1713,38 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We build the index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1722,9 +1773,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1742,24 +1812,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1786,7 +1846,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1811,6 +1871,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3531,6 +3638,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3636,8 +3744,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3689,8 +3804,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3751,6 +3873,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3854,15 +3983,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3913,6 +4045,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3926,12 +4063,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3940,6 +4082,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3958,10 +4101,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4042,13 +4189,56 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4091,6 +4281,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4098,12 +4323,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4141,7 +4360,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4170,7 +4389,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4260,14 +4479,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4292,6 +4511,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4305,11 +4546,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4329,6 +4570,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index acd20dbfab8..6c43f47814d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,11 +708,12 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	void 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												Snapshot snapshot,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1820,19 +1821,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
 						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  snapshot,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..4713f18e68d 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3850dde4adb..76f25ec686f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -187,8 +187,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 9ade7b835e6..ca74844b5c6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6cf828ca8d0..ed6c20a495c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,14 +2041,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index e21ff426519..2cff1ac29be 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v19-0011-Refresh-snapshot-periodically-during-index-valid.patch (32.1K, 6-v19-0011-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 02dc23e508622b19ef2df3df1de763cd37ddb58b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:18:32 +0200
Subject: [PATCH v19 11/12] Refresh snapshot periodically during index
 validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 doc/src/sgml/ref/create_index.sgml            | 11 ++-
 doc/src/sgml/ref/reindex.sgml                 | 11 ++-
 src/backend/access/heap/README.HOT            |  4 +-
 src/backend/access/heap/heapam_handler.c      | 77 ++++++++++++++++---
 src/backend/access/nbtree/nbtsort.c           |  2 +-
 src/backend/access/spgist/spgvacuum.c         | 12 ++-
 src/backend/catalog/index.c                   | 42 +++++++---
 src/backend/commands/indexcmds.c              | 50 ++----------
 src/include/access/tableam.h                  |  7 +-
 src/include/access/transam.h                  | 15 ++++
 src/include/catalog/index.h                   |  2 +-
 .../expected/cic_reset_snapshots.out          | 28 +++++++
 .../sql/cic_reset_snapshots.sql               |  1 +
 13 files changed, 179 insertions(+), 83 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 298a093f554..6220a80474f 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -881,9 +881,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 634ba55d184..b887574f106 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -498,10 +498,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 6f718feb6d5..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ use the key value from the live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 633bc245e28..4456b16df70 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2034,23 +2034,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,14 +2064,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2084,6 +2089,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2117,6 +2145,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2172,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2181,9 +2224,25 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid", NULL);
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d186ce9ec37..8d755470e8c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2678f7ab782..968a8f7725c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d1b96703bbc..e707b012f41 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3534,8 +3534,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3548,7 +3549,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3569,13 +3570,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3625,8 +3627,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3662,6 +3668,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3671,6 +3680,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3690,19 +3702,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3725,6 +3742,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 354ce8dd463..f58e138eed2 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -592,7 +592,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1794,32 +1793,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1841,8 +1819,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4348,7 +4326,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4363,13 +4340,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4381,16 +4351,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4403,7 +4365,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6c43f47814d..d38a6961035 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,10 +708,9 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void 		(*index_validate_scan) (Relation table_rel,
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
 												Relation index_rel,
 												struct IndexInfo *index_info,
-												Snapshot snapshot,
 												struct ValidateIndexState *state,
 												struct ValidateIndexState *aux_state);
 
@@ -1825,18 +1824,16 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
 						  struct ValidateIndexState *state,
 						  struct ValidateIndexState *auxstate)
 {
 	return table_rel->rd_tableam->index_validate_scan(table_rel,
 													  index_rel,
 													  index_info,
-													  snapshot,
 													  state,
 													  auxstate);
 }
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 53b2b13efc3..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -154,7 +154,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
-- 
2.43.0



  [application/octet-stream] v19-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.9K, 7-v19-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From bade8833d582234c10aac67fb86cbb3659580718 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v19 06/12] Add STIR access method and flags related to
 auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index a6dad54ff58..ca5214461e6 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad09..232c87ec267 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3092,6 +3092,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3143,6 +3144,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..01f3b660f4b
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cca1dbb8e37..e9e22ec0e84 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3433,6 +3433,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4fffb76e557..38602e6a72d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..286fcccec3d 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..b99916edb4a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -73,6 +73,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index dfbb4c85460..a121b4d31c9 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62beb71da28..f05a5eecdda 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5b6cadb5a6c..3850dde4adb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -182,12 +182,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,6 +217,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index cf48ae6d0c2..52dde57680d 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5137,7 +5137,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5151,7 +5152,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5176,9 +5178,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5187,12 +5189,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5201,7 +5204,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v19-0007-Add-Datum-storage-support-to-tuplestore.patch (17.3K, 8-v19-0007-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From 9fa8406cb66f0dcff6e16e0fe64fd7b6d099f6bf Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v19 07/12] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..12ae705c091 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v19-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch (39.4K, 9-v19-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch)
  download | inline diff:
From 52ab4a6c4d7d944c4ca26b800d504c7bf507ef9f Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v19 05/12] Support snapshot resets in concurrent builds of
 unique indexes

Previously, concurrent builds if unique index used a fixed snapshot for the entire scan to ensure proper uniqueness checks.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:
- ignoring SnapshotSelf dead tuples during uniqueness checks in tuplesort as not a guarantee, but a fail-fast mechanics
- adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values as a guarantee of correctness

Tuples are SnapshotSelf tested only in the case of equal index key values, overwise _bt_load works like before.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 264 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7273b1aee00..0eaa4df5582 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 08884116aec..347b50d6e51 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2f45ae96c0c..d186ce9ec37 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1320,7 +1432,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1417,7 +1529,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,21 +1546,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1457,16 +1559,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1536,6 +1638,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1550,7 +1653,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1630,7 +1733,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1641,7 +1744,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1744,6 +1847,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1847,11 +1951,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1931,6 +2036,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1953,14 +2059,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1a15dfcb7d3..d07fe72713d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -66,8 +66,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2532,7 +2530,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3852,7 +3850,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3970,17 +3968,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4006,6 +4011,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4025,7 +4032,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4036,7 +4043,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4045,6 +4053,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4053,7 +4063,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4070,6 +4081,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6432ef55cdc..cca1dbb8e37 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3323,9 +3323,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a93d4f388bc..15206d27227 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,8 +1694,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 5f70e8dddac..71a5c21e0df 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -133,6 +134,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -358,6 +360,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -400,6 +403,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1653,6 +1657,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1662,18 +1667,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..38471e90a0c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1339,8 +1339,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a69f71a3ace..acd20dbfab8 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1754,9 +1754,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v19-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch (41.2K, 10-v19-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch)
  download | inline diff:
From 8eaa54a4919625a8fa69854fef670d4b3258bff8 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v19 04/12] Support snapshot resets in parallel concurrent
 index builds

Extend periodic snapshot reset support to parallel builds, previously limited to non-parallel operations. This allows the xmin horizon to advance during parallel concurrent index builds as well.

The main limitation of applying that technic to parallel builds was a requirement to wait until workers processes restore their initial snapshot from leader.

To address this, following changes applied:
- add infrastructure to track snapshot restoration in parallel workers
- extend parallel scan initialization to support periodic snapshot resets
- wait for parallel workers to restore their initial snapshots before proceeding with scan
- relax limitation for parallel worker to call GetLatestSnapshot
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e5a945a1b14..423424e51a2 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 4cea1612ce6..629f6d5f2c0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *state);
 static double _gin_parallel_merge(GinBuildState *state);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1790,14 +1790,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1820,6 +1820,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2179,6 +2180,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8a584db595a..7273b1aee00 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f3986d086b6..2f45ae96c0c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1420,6 +1417,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1437,12 +1435,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1450,6 +1457,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1510,7 +1522,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1537,7 +1549,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1613,6 +1626,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1621,7 +1641,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1645,7 +1666,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1895,6 +1916,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1949,11 +1971,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1989,4 +2015,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..6f04c365994 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize", NULL);
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cbd0ba9aa01..6432ef55cdc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index ed35c58c2c3..8a15dd72b91 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ad440ff024c..f251bc52895 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -342,14 +342,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8df6ba9b89e..a69f71a3ace 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1135,7 +1135,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1753,9 +1754,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v19-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch (46.1K, 11-v19-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch)
  download | inline diff:
From 1d5a4fbd43c023b3010c61453b5846801792e0fc Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v19 03/12] Reset snapshots periodically in non-unique
 non-parallel concurrent index builds

Long-living snapshots used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon. Commit d9d076222f5b attempted to allow VACUUM to ignore such snapshots to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces an alternative by periodically resetting the snapshot used during the first phase. By resetting the snapshot every N pages during the heap scan, it allows the xmin horizon to advance.

Currently, this technique is applied to:

- only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness
- non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a following commits
- non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, will be addressed in a following commits

A new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset "between" every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3048e044aec..e59197bb35e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -558,7 +558,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 0d9c2b0b653..a6dad54ff58 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 01e1db7f856..e5a945a1b14 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a65acd89104..4cea1612ce6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ec8cda1c68..10316246e4d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -612,6 +613,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective", NULL);
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -653,7 +684,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1304,6 +1340,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..8a584db595a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..c9c53044748 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3794cc924ad..f3986d086b6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1409,6 +1419,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1434,9 +1445,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1490,6 +1508,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1584,6 +1604,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1600,6 +1622,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..cbd0ba9aa01 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1544,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3253,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3317,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0f75debe7f1..a93d4f388bc 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,23 +1694,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4073,9 +4067,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4090,7 +4081,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 49ad6e83578..ded9eecfbc0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6901,6 +6902,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6956,6 +6958,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7013,6 +7020,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e48fe434cd3..6caad42ea4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..8df6ba9b89e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -62,6 +63,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -893,7 +905,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -901,6 +914,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots", NULL);
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1730,6 +1752,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..19d26408c2a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc
+REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..8476bfe72a7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -37,6 +37,7 @@ tests += {
       'injection_points',
       'hashagg',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v19-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 12-v19-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 52d582222e047be124ff5e9a653178eec085f0f7 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v19 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v19-only-part-3-0005-Optimize-auxiliary-index-handling.patch_ (2.4K, 13-v19-only-part-3-0005-Optimize-auxiliary-index-handling.patch_)
  download

  [application/octet-stream] v19-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (25.3K, 14-v19-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From 9993b0c3dc8df7b3a026e7c8f6a43b5ab592a833 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v19 01/12] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/meson.build                   |   1 +
 .../t/006_cic_bt_index_parent_check.pl        |  39 +++++
 contrib/amcheck/verify_nbtree.c               |  68 ++++-----
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 +++++++++++++--
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++++++++-----
 src/backend/utils/time/snapmgr.c              |   2 +
 9 files changed, 285 insertions(+), 88 deletions(-)
 create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index b33e8c9b062..b040000dd55 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -49,6 +49,7 @@ tests += {
       't/003_cic_2pc.pl',
       't/004_verify_nbtree_unique.pl',
       't/005_pitr.pl',
+      't/006_cic_bt_index_parent_check.pl',
     ],
   },
 }
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 00000000000..6e52c5e39ec
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,39 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(q(BEGIN; SELECT pg_current_xact_id();));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+is($result, '0', 'bt_index_parent_check for CIC after removed row');
+
+$in_progress_h->quit;
+done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..3048e044aec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d962fe392cd..0f75debe7f1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1790,6 +1790,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4195,7 +4196,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4274,6 +4275,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..499cba145dd 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 3f8a4cb5244..f1757d02f1c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -487,6 +487,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -697,6 +739,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -707,23 +751,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 46d533b7288..566dbecb390 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1178,6 +1179,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 59233b64730..0c720e450e9 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -716,12 +716,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -756,8 +758,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -769,30 +771,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -815,7 +863,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -835,27 +889,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -875,7 +925,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -883,6 +933,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -920,27 +974,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -948,7 +1010,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..ad440ff024c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -447,6 +448,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0



  [application/octet-stream] v19-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_ (28.7K, 15-v19-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_)
  download

  [application/octet-stream] v19-only-part-3-0006-Refresh-snapshot-periodically-during.patch_ (20.7K, 16-v19-only-part-3-0006-Refresh-snapshot-periodically-during.patch_)
  download

  [application/octet-stream] v19-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_ (36.9K, 17-v19-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_)
  download

  [application/octet-stream] v19-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_ (17.3K, 18-v19-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_)
  download

  [application/octet-stream] v19-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_ (96.9K, 19-v19-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_)
  download

  [application/pdf] STIR-poster.pdf (1.5M, 20-STIR-poster.pdf)
  download

^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-05-18 15:56  Álvaro Herrera <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Álvaro Herrera @ 2025-05-18 15:56 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello Mihail,

On 2025-May-18, Mihail Nikalayeu wrote:

> Hello, everyone!
> 
> Rebased version + materials from PGConf.dev 2025 Poster Session :)

I agree with Matthias that this work is important, so thank you for
persisting on it.

I didn't understand why you have a few "v19" patches and also a separate
series of "v19-only-part-3-" patches.  Is there duplication?  How do
people know which series comes first?

I think it would be better to get the PDF poster in a wiki page ... in
fact I would suggest to Andrey that he could start a wiki page with all
the PDFs presented at the conference.  Distributing a bunch of 2 MB pdf
via the mailing list doesn't sound too great an idea to me.  A few
people are having trouble with email quotas in cloud services, and the
list server gets bothered because of it.  Kindly don't do that anymore.

Regards

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/





^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-05-18 16:09  Mihail Nikalayeu <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  0 siblings, 1 reply; 33+ messages in thread

From: Mihail Nikalayeu @ 2025-05-18 16:09 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, Álvaro!

> I didn't understand why you have a few "v19" patches and also a separate
> series of "v19-only-part-3-" patches.  Is there duplication?  How do
> people know which series comes first?

This was explained in the previous email [0]:

> Patch itself contains 4 parts, some of them may be reviewed/committed
> separately. All commit messages are detailed and contain additional
> explanation of changes.

> To not confuse CFBot, commits are presented in the following way: part
> 1, 2, 3 and 4. If you want only part 3 to test/review – check the
> files with "patch_" extensions. They differ a little bit, but changes
> are minor.

If you have an idea of a better way to handle it, please share. Yes,
the current approach is a bit odd.

> I think it would be better to get the PDF poster in a wiki page ... in
> fact I would suggest to Andrey that he could start a wiki page with all
> the PDFs presented at the conference.  Distributing a bunch of 2 MB pdf
> via the mailing list doesn't sound too great an idea to me.  A few
> people are having trouble with email quotas in cloud services, and the
> list server gets bothered because of it.  Kindly don't do that anymore.

Oh, you're right—I just didn't think of that. My bad, sorry about that.

Best regards,
Mikhail.

[0]: https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%4...





^ permalink  raw  reply  [nested|flat] 33+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-05-23 21:59  Mihail Nikalayeu <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 0 replies; 33+ messages in thread

From: Mihail Nikalayeu @ 2025-05-23 21:59 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone!

Rebased, patch structure and comments available here [0]. Quick
introduction poster - here [1].

Best regards,
Mikhail.

[0]: https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%4...
[1]: https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf


Attachments:

  [application/octet-stream] v20-only-part-3-0002-Use-auxiliary-indexes-for-concurrent.patch_ (96.9K, 2-v20-only-part-3-0002-Use-auxiliary-indexes-for-concurrent.patch_)
  download

  [application/octet-stream] v20-only-part-3-0005-Refresh-snapshot-periodically-during.patch_ (20.7K, 3-v20-only-part-3-0005-Refresh-snapshot-periodically-during.patch_)
  download

  [application/octet-stream] v20-only-part-3-0004-Optimize-auxiliary-index-handling.patch_ (2.4K, 4-v20-only-part-3-0004-Optimize-auxiliary-index-handling.patch_)
  download

  [application/octet-stream] v20-only-part-3-0003-Track-and-drop-auxiliary-indexes-in-.patch_ (28.3K, 5-v20-only-part-3-0003-Track-and-drop-auxiliary-indexes-in-.patch_)
  download

  [application/octet-stream] v20-only-part-3-0001-Add-Datum-storage-support-to-tuplest.patch_ (17.3K, 6-v20-only-part-3-0001-Add-Datum-storage-support-to-tuplest.patch_)
  download

  [application/octet-stream] v20-0011-Refresh-snapshot-periodically-during-index-valid.patch (32.1K, 7-v20-0011-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From bc8f8fb41c54b03f7298396f24bd2e007e327aa9 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:18:32 +0200
Subject: [PATCH v20 11/12] Refresh snapshot periodically during index
 validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 doc/src/sgml/ref/create_index.sgml            | 11 ++-
 doc/src/sgml/ref/reindex.sgml                 | 11 ++-
 src/backend/access/heap/README.HOT            |  4 +-
 src/backend/access/heap/heapam_handler.c      | 77 ++++++++++++++++---
 src/backend/access/nbtree/nbtsort.c           |  2 +-
 src/backend/access/spgist/spgvacuum.c         | 12 ++-
 src/backend/catalog/index.c                   | 42 +++++++---
 src/backend/commands/indexcmds.c              | 50 ++----------
 src/include/access/tableam.h                  |  7 +-
 src/include/access/transam.h                  | 15 ++++
 src/include/catalog/index.h                   |  2 +-
 .../expected/cic_reset_snapshots.out          | 28 +++++++
 .../sql/cic_reset_snapshots.sql               |  1 +
 13 files changed, 179 insertions(+), 83 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 298a093f554..6220a80474f 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -881,9 +881,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index d62791ff9c3..60f4d0d680f 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -502,10 +502,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 6f718feb6d5..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ use the key value from the live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 633bc245e28..4456b16df70 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2034,23 +2034,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,14 +2064,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2084,6 +2089,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2117,6 +2145,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2172,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2181,9 +2224,25 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid", NULL);
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d186ce9ec37..8d755470e8c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2678f7ab782..968a8f7725c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d1b96703bbc..e707b012f41 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3534,8 +3534,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3548,7 +3549,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3569,13 +3570,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3625,8 +3627,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3662,6 +3668,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3671,6 +3680,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3690,19 +3702,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3725,6 +3742,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 354ce8dd463..f58e138eed2 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -592,7 +592,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1794,32 +1793,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1841,8 +1819,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4348,7 +4326,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4363,13 +4340,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4381,16 +4351,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4403,7 +4365,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6c43f47814d..d38a6961035 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,10 +708,9 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void 		(*index_validate_scan) (Relation table_rel,
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
 												Relation index_rel,
 												struct IndexInfo *index_info,
-												Snapshot snapshot,
 												struct ValidateIndexState *state,
 												struct ValidateIndexState *aux_state);
 
@@ -1825,18 +1824,16 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
 						  struct ValidateIndexState *state,
 						  struct ValidateIndexState *auxstate)
 {
 	return table_rel->rd_tableam->index_validate_scan(table_rel,
 													  index_rel,
 													  index_info,
-													  snapshot,
 													  state,
 													  auxstate);
 }
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 53b2b13efc3..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -154,7 +154,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
-- 
2.43.0



  [application/octet-stream] v20-0012-Remove-PROC_IN_SAFE_IC-optimization.patch (21.3K, 8-v20-0012-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 4058b4ad87706a184fdae7b1c0d6eb43b267ea7f Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v20 12/12] Remove PROC_IN_SAFE_IC optimization

This optimization allowed concurrent index builds to ignore other indexes without expressions or predicates. With the new snapshot handling approach that periodically refreshes snapshots, this optimization is no longer necessary.

The change simplifies concurrent index build code by:
- removing the PROC_IN_SAFE_IC process status flag
- eliminating set_indexsafe_procflags() calls and related logic
- removing special case handling in GetCurrentVirtualXIDs()
- removing related test cases and injection points
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 423424e51a2..93ad3f3f632 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 629f6d5f2c0..df79b5850f9 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2106,11 +2106,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8d755470e8c..00c86bfcfc6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1910,11 +1910,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f58e138eed2..2f066f45c62 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1181,10 +1174,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1671,10 +1660,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1729,9 +1714,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1761,10 +1743,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1790,9 +1768,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1809,9 +1785,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1852,10 +1825,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1876,10 +1845,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3620,7 +3585,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3994,17 +3958,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe", NULL);
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe", NULL);
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4070,7 +4023,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
 		newidx->junkAuxIndexId = junkAuxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4171,11 +4123,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4207,10 +4154,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4219,11 +4162,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4248,10 +4186,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4271,11 +4205,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4297,10 +4226,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4336,10 +4261,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4367,9 +4288,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4391,13 +4309,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4453,12 +4364,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4522,12 +4427,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4795,36 +4694,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5e07466c737 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 19d26408c2a..82acf3006bd 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
+REGRESS = injection_points hashagg cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8476bfe72a7..bddf22df3ac 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v20-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (28.3K, 9-v20-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From 6fbab3cb700228df664a593dd973d90872c788be Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v20 09/12] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 358 insertions(+), 36 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e7a7a160742..298a093f554 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 4ed3c969012..d62791ff9c3 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 18316a3968b..ab4c3e2fb4a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6c09c6a2b67..bf0bb79474b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -688,6 +688,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -734,6 +736,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -776,6 +779,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1177,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1459,6 +1473,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1609,6 +1624,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3842,6 +3858,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3898,6 +3915,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4186,7 +4216,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4275,13 +4306,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4307,18 +4355,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 65fa7fd74e0..354ce8dd463 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1260,7 +1260,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3639,6 +3639,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3988,6 +3989,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3995,6 +3997,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4068,12 +4071,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4083,6 +4091,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4104,10 +4113,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4288,7 +4305,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4311,6 +4329,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4529,6 +4550,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4580,6 +4603,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 54ad38247aa..a1043c183f0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4713f18e68d..53b2b13efc3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index ca74844b5c6..aca6ec57ad7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 2cff1ac29be..e1464eaa67c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v20-0010-Optimize-auxiliary-index-handling.patch (2.4K, 10-v20-0010-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From efd01b195da4b23dc1dc76c44f4f671a8427936b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v20 10/12] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bf0bb79474b..d1b96703bbc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2932,6 +2932,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 499cba145dd..c8b51e2725c 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v20-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch (39.4K, 11-v20-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch)
  download | inline diff:
From 411431eba585d5502c0c9d16376e41a2258590be Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v20 05/12] Support snapshot resets in concurrent builds of
 unique indexes

Previously, concurrent builds if unique index used a fixed snapshot for the entire scan to ensure proper uniqueness checks.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:
- ignoring SnapshotSelf dead tuples during uniqueness checks in tuplesort as not a guarantee, but a fail-fast mechanics
- adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values as a guarantee of correctness

Tuples are SnapshotSelf tested only in the case of equal index key values, overwise _bt_load works like before.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 264 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7273b1aee00..0eaa4df5582 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 08884116aec..347b50d6e51 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2f45ae96c0c..d186ce9ec37 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1320,7 +1432,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1417,7 +1529,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,21 +1546,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1457,16 +1559,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1536,6 +1638,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1550,7 +1653,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1630,7 +1733,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1641,7 +1744,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1744,6 +1847,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1847,11 +1951,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1931,6 +2036,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1953,14 +2059,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1a15dfcb7d3..d07fe72713d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -66,8 +66,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2532,7 +2530,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3852,7 +3850,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3970,17 +3968,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4006,6 +4011,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4025,7 +4032,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4036,7 +4043,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4045,6 +4053,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4053,7 +4063,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4070,6 +4081,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6432ef55cdc..cca1dbb8e37 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3323,9 +3323,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a93d4f388bc..15206d27227 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,8 +1694,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 5f70e8dddac..71a5c21e0df 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -133,6 +134,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -358,6 +360,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -400,6 +403,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1653,6 +1657,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1662,18 +1667,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..38471e90a0c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1339,8 +1339,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a69f71a3ace..acd20dbfab8 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1754,9 +1754,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v20-0007-Add-Datum-storage-support-to-tuplestore.patch (17.3K, 12-v20-0007-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From a5b799a999f8cc7dbc934454f0d47dff14c7fda6 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v20 07/12] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..12ae705c091 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v20-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.9K, 13-v20-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From b81c096c3aafefb4591eefbf5e60d378050ec309 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v20 06/12] Add STIR access method and flags related to
 auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index a6dad54ff58..ca5214461e6 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad09..232c87ec267 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3092,6 +3092,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3143,6 +3144,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..01f3b660f4b
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cca1dbb8e37..e9e22ec0e84 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3433,6 +3433,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4fffb76e557..38602e6a72d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..286fcccec3d 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..b99916edb4a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -73,6 +73,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index dfbb4c85460..a121b4d31c9 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62beb71da28..f05a5eecdda 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2492282213f..0341bb74325 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -181,12 +181,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -215,6 +216,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index cf48ae6d0c2..52dde57680d 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5137,7 +5137,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5151,7 +5152,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5176,9 +5178,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5187,12 +5189,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5201,7 +5204,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v20-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch (96.8K, 14-v20-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 41f6ddbb909c6ac2fc408030805f0312d474b709 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v20 08/12] Use auxiliary indexes for concurrent index
 operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 545 +++++++++++++--------
 src/backend/catalog/index.c                | 292 +++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/catalog/toasting.c             |   3 +-
 src/backend/commands/indexcmds.c           | 337 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  28 +-
 src/include/catalog/index.h                |  12 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/execnodes.h              |   4 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 19 files changed, 1104 insertions(+), 351 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4265a22d4de..8ccd69b14c2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6314,6 +6314,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6354,13 +6366,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6377,8 +6388,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 147a8f7587c..e7a7a160742 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c4055397146..4ed3c969012 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..6f718feb6d5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0eaa4df5582..633bc245e28 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,243 +1782,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e9e22ec0e84..6c09c6a2b67 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -744,7 +749,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -755,11 +761,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +798,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1408,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1463,7 +1474,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1473,6 +1485,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2469,7 +2630,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2529,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3306,12 +3469,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3321,18 +3493,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3340,12 +3515,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3363,22 +3540,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3411,6 +3592,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3435,15 +3617,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3466,27 +3663,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3495,6 +3695,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3555,6 +3756,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3826,6 +4032,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4068,6 +4281,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4093,6 +4307,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 15efb02badb..edd61c294a6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1288,16 +1288,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 15206d27227..65fa7fd74e0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,6 +588,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -833,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1251,7 +1267,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1593,6 +1610,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1621,11 +1648,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1635,7 +1662,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1674,7 +1701,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1686,14 +1713,38 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We build the index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1722,9 +1773,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1742,24 +1812,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1786,7 +1846,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1811,6 +1871,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3531,6 +3638,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3636,8 +3744,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3689,8 +3804,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3751,6 +3873,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3854,15 +3983,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3913,6 +4045,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3926,12 +4063,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3940,6 +4082,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3958,10 +4101,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4042,13 +4189,56 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4091,6 +4281,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4098,12 +4323,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4141,7 +4360,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4170,7 +4389,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4260,14 +4479,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4292,6 +4511,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4305,11 +4546,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4329,6 +4570,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index acd20dbfab8..6c43f47814d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,11 +708,12 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	void 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												Snapshot snapshot,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1820,19 +1821,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
 						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  snapshot,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..4713f18e68d 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0341bb74325..e02fc6aa3e6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -186,8 +186,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 9ade7b835e6..ca74844b5c6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6cf828ca8d0..ed6c20a495c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,14 +2041,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index e21ff426519..2cff1ac29be 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v20-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch (46.1K, 15-v20-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch)
  download | inline diff:
From 7a9042056dec25923c166bee36b72e1b3573c5d7 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v20 03/12] Reset snapshots periodically in non-unique
 non-parallel concurrent index builds

Long-living snapshots used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon. Commit d9d076222f5b attempted to allow VACUUM to ignore such snapshots to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces an alternative by periodically resetting the snapshot used during the first phase. By resetting the snapshot every N pages during the heap scan, it allows the xmin horizon to advance.

Currently, this technique is applied to:

- only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness
- non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a following commits
- non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, will be addressed in a following commits

A new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset "between" every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3048e044aec..e59197bb35e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -558,7 +558,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 0d9c2b0b653..a6dad54ff58 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 01e1db7f856..e5a945a1b14 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a65acd89104..4cea1612ce6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ec8cda1c68..10316246e4d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -612,6 +613,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective", NULL);
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -653,7 +684,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1304,6 +1340,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..8a584db595a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..c9c53044748 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3794cc924ad..f3986d086b6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1409,6 +1419,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1434,9 +1445,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1490,6 +1508,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1584,6 +1604,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1600,6 +1622,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..cbd0ba9aa01 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1544,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3253,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3317,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0f75debe7f1..a93d4f388bc 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,23 +1694,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4073,9 +4067,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4090,7 +4081,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ff65867eebe..0d5e54e0cc2 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6899,6 +6900,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6954,6 +6956,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7011,6 +7018,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e48fe434cd3..6caad42ea4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..8df6ba9b89e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -62,6 +63,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -893,7 +905,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -901,6 +914,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots", NULL);
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1730,6 +1752,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..19d26408c2a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc
+REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..8476bfe72a7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -37,6 +37,7 @@ tests += {
       'injection_points',
       'hashagg',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v20-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch (41.2K, 16-v20-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch)
  download | inline diff:
From d85b235e8917062dd2d62a008003b89ed035917e Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v20 04/12] Support snapshot resets in parallel concurrent
 index builds

Extend periodic snapshot reset support to parallel builds, previously limited to non-parallel operations. This allows the xmin horizon to advance during parallel concurrent index builds as well.

The main limitation of applying that technic to parallel builds was a requirement to wait until workers processes restore their initial snapshot from leader.

To address this, following changes applied:
- add infrastructure to track snapshot restoration in parallel workers
- extend parallel scan initialization to support periodic snapshot resets
- wait for parallel workers to restore their initial snapshots before proceeding with scan
- relax limitation for parallel worker to call GetLatestSnapshot
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e5a945a1b14..423424e51a2 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 4cea1612ce6..629f6d5f2c0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *state);
 static double _gin_parallel_merge(GinBuildState *state);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1790,14 +1790,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1820,6 +1820,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2179,6 +2180,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8a584db595a..7273b1aee00 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f3986d086b6..2f45ae96c0c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1420,6 +1417,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1437,12 +1435,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1450,6 +1457,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1510,7 +1522,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1537,7 +1549,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1613,6 +1626,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1621,7 +1641,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1645,7 +1666,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1895,6 +1916,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1949,11 +1971,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1989,4 +2015,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..6f04c365994 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize", NULL);
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cbd0ba9aa01..6432ef55cdc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index ed35c58c2c3..8a15dd72b91 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ad440ff024c..f251bc52895 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -342,14 +342,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8df6ba9b89e..a69f71a3ace 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1135,7 +1135,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1753,9 +1754,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v20-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 17-v20-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 62602601260a531754108a9e00eeb863d98b3eac Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v20 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v20-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (25.3K, 18-v20-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From fc9f12ce38e1c50b21fb48b244da51eba3072536 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v20 01/12] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/meson.build                   |   1 +
 .../t/006_cic_bt_index_parent_check.pl        |  39 +++++
 contrib/amcheck/verify_nbtree.c               |  68 ++++-----
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 +++++++++++++--
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++++++++-----
 src/backend/utils/time/snapmgr.c              |   2 +
 9 files changed, 285 insertions(+), 88 deletions(-)
 create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index b33e8c9b062..b040000dd55 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -49,6 +49,7 @@ tests += {
       't/003_cic_2pc.pl',
       't/004_verify_nbtree_unique.pl',
       't/005_pitr.pl',
+      't/006_cic_bt_index_parent_check.pl',
     ],
   },
 }
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 00000000000..6e52c5e39ec
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,39 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(q(BEGIN; SELECT pg_current_xact_id();));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+is($result, '0', 'bt_index_parent_check for CIC after removed row');
+
+$in_progress_h->quit;
+done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..3048e044aec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d962fe392cd..0f75debe7f1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1790,6 +1790,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4195,7 +4196,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4274,6 +4275,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..499cba145dd 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 514eae1037d..8851f0fda06 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -486,6 +486,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -696,6 +738,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -706,23 +750,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 46d533b7288..566dbecb390 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1178,6 +1179,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 59233b64730..0c720e450e9 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -716,12 +716,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -756,8 +758,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -769,30 +771,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -815,7 +863,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -835,27 +889,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -875,7 +925,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -883,6 +933,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -920,27 +974,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -948,7 +1010,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..ad440ff024c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -447,6 +448,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 33+ messages in thread


end of thread, other threads:[~2025-05-23 21:59 UTC | newest]

Thread overview: 33+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-05-07 12:35 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
2024-05-07 20:23 ` Michail Nikolaev <[email protected]>
2024-05-09 13:00   ` Michail Nikolaev <[email protected]>
2024-06-11 08:58     ` Michail Nikolaev <[email protected]>
2024-08-06 23:40       ` Matthias van de Meent <[email protected]>
2024-08-08 13:53         ` Michail Nikolaev <[email protected]>
2024-09-01 21:19           ` Michail Nikolaev <[email protected]>
2024-09-08 15:18             ` Michail Nikolaev <[email protected]>
2024-11-12 15:00               ` Michail Nikolaev <[email protected]>
2024-12-02 01:39                 ` Michail Nikolaev <[email protected]>
2024-12-09 20:53                   ` Michail Nikolaev <[email protected]>
2024-12-17 23:29                     ` Michail Nikolaev <[email protected]>
2024-12-21 18:00                       ` Michail Nikolaev <[email protected]>
2024-12-24 13:06                         ` Michail Nikolaev <[email protected]>
2024-12-24 19:39                           ` Michail Nikolaev <[email protected]>
2024-12-25 05:19                           ` Michael Paquier <[email protected]>
2024-12-25 15:14                             ` Michail Nikolaev <[email protected]>
2025-01-01 16:16                               ` Michail Nikolaev <[email protected]>
2025-01-01 17:53                                 ` Michail Nikolaev <[email protected]>
2025-01-04 01:12                                 ` Matthias van de Meent <[email protected]>
2025-01-06 13:36                                   ` Michail Nikolaev <[email protected]>
2025-01-08 02:12                                     ` Michail Nikolaev <[email protected]>
2025-01-18 14:18                                       ` Michail Nikolaev <[email protected]>
2025-01-30 01:00                                         ` Michail Nikolaev <[email protected]>
2025-02-04 01:38                                           ` Michail Nikolaev <[email protected]>
2025-02-20 14:56                                             ` Mihail Nikalayeu <[email protected]>
2025-03-07 22:58                                               ` Michail Nikolaev <[email protected]>
2025-04-06 23:45                                                 ` Mihail Nikalayeu <[email protected]>
2025-04-30 20:01                                                   ` Mihail Nikalayeu <[email protected]>
2025-05-18 15:09                                                     ` Mihail Nikalayeu <[email protected]>
2025-05-18 15:56                                                       ` Álvaro Herrera <[email protected]>
2025-05-18 16:09                                                         ` Mihail Nikalayeu <[email protected]>
2025-05-23 21:59                                                           ` Mihail Nikalayeu <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox