public inbox for [email protected]
help / color / mirror / Atom feedFrom: Mihail Nikalayeu <[email protected]>
To: Matthias van de Meent <[email protected]>
Cc: Antonin Houska <[email protected]>
Cc: Hannu Krosing <[email protected]>
Cc: Sergey Sargsyan <[email protected]>
Cc: Álvaro Herrera <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Michael Paquier <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Andrey Borodin <[email protected]>
Cc: Melanie Plageman <[email protected]>
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Date: Mon, 23 Mar 2026 23:08:00 +0100
Message-ID: <CADzfLwVeQikArGV885zRMiSDW2-y=h=bvQiROvdq1Re1ojx9QA@mail.gmail.com> (raw)
In-Reply-To: <CADzfLwU4oHR9gMV+OHgerp2ZLi3BSfLRZcBsG8g3G0f_RLPrGg@mail.gmail.com>
References: <CADzfLwW9QczZW-E=McxcjUv0e5VMDctQNETbgao0K-SimVhFPA@mail.gmail.com>
<[email protected]>
<CADzfLwXKtriMnfCNVGNH2ahwXaByjo-QOMWiDTU-9WZqh+zQ5g@mail.gmail.com>
<CADzfLwW5bDWSxjHK7mqX8Lewki3+5FBydBC+nVcxg4xMGKscyw@mail.gmail.com>
<CAMAof6-4xaV3QE2ErYJaJhu6qjFn99sWyo_HQeBhHikZM3GexA@mail.gmail.com>
<CADzfLwXocKhpW3eFP1oScz+m+1XJ3bpi9QmVpoqC9RX9oyX=UA@mail.gmail.com>
<CAMAof695VA+mbVRhWCTus=E0WnsMAQyqXxfOTohbcb7VUHSP4g@mail.gmail.com>
<CAMAof69JSL8MYWG2qRScs3RQDpfcyZT_wFwW4SoAvftW+K_p1g@mail.gmail.com>
<CADzfLwVMtwjHh8KY9kP=_vcYPqHs=JDzuexO4RFQ2fM8VoqovA@mail.gmail.com>
<CAMAof68L0GO0F0bwuXtLZAjh9k_Hj+o0-8mqfO6iEQyXr4PuVA@mail.gmail.com>
<CADzfLwUrodAcOggK+3j3LbPLaSXemgHxa-n=LhZTwRAsaakL2g@mail.gmail.com>
<CAMAof691D4O=3QTuPwJXBYxYpG6s3A=tVhL9vN=T3eeRTMnaig@mail.gmail.com>
<CADzfLwVT3Y14g6Maz2y92sP2L7rPvpznt+MHM++xiy-U3XMLZQ@mail.gmail.com>
<CADzfLwXQe9XfQfJs3W-DCPqeqG4rq-6FoYUpGbbpgjcT1Eotpg@mail.gmail.com>
<CAMAof68kNgwWdkhmZd1ysfyU3PF66Wz+UaUr9g-LJg-_0xBV_Q@mail.gmail.com>
<CADzfLwUtLqYrupZp4QQuWwv4W_LgYWBRStybvQ+S0SZiHrp62A@mail.gmail.com>
<CADzfLwVYUBb8cUVQ_1mzVzNMyJH84VZKFCRyATvBZKbLW377CA@mail.gmail.com>
<CADzfLwWbV1i7+cP_Hqr3qgQnBXkAqgrCQxd5PFzqp2AOTK=40w@mail.gmail.com>
<CADzfLwXJc0jdDDS43-Fj0gKmwX-FURS3eY7MyLQ89qDPA6T5Ug@mail.gmail.com>
<CADzfLwVaV15R2rUNZmKqLKweiN3SnUBg=6_qGE_ERb7cdQUD8g@mail.gmail.com>
<CAEze2WgBffcC_SKGLmVxW8uRTEsrwWOHDQujN6zyxy1tSYLJ=Q@mail.gmail.com>
<CADzfLwVon8ESWOkg+8KU0F9=Hg7QKriNVX-hqcm-v-XZmHkzig@mail.gmail.com>
<CAEze2WiXYx1LKr=9d7PLsZOYrGytY9AN__tFFw4p_Ysgm1-e5g@mail.gmail.com>
<CADzfLwUKXcXKZgX+e8ACsOXe_CgtWmNJY_6dyn8EO0AXYOn2pA@mail.gmail.com>
<CAEze2WiiR2PeXg_vaURjjiiwvjQ=Um8wxWi1BcVS0BGyxiD2gQ@mail.gmail.com>
<CAMT0RQQP9JiGqqB+pVBzPT7unG1BMBuLj=kGPk4BeS3g6VyT1A@mail.gmail.com>
<CAMT0RQSbFJCpetFy22=O=gKR2ZfH=tMTQeCM743T4o3rMjaeTQ@mail.gmail.com>
<8010.1764584989@localhost>
<CADzfLwUjgOxk8dd8XLv3jn05gAxxwEZoAA7-2Owb2CczkSb6Tw@mail.gmail.com>
<5778.1764660480@localhost>
<CADzfLwXMzv4_2Mc54qpASJ7FrKwQ6OsG0d9GxiYDxEKZ2KqHSA@mail.gmail.com>
<CAEze2Wg6d2M8hop4uwdQXeH-YkOmEHyqp83+aE7vEzKdmu7w-A@mail.gmail.com>
<CADzfLwVPQOF7m5x4d6=1tRJO==FUZdc3LeO=N_TpmOzii7iH_Q@mail.gmail.com>
<CADzfLwVHzfdD3WB0iNDKJtXQ37VCWrYnCULz3QKzwM7jJFHt-A@mail.gmail.com>
<CADzfLwU4oHR9gMV+OHgerp2ZLi3BSfLRZcBsG8g3G0f_RLPrGg@mail.gmail.com>
Hello!
Fixed compilation, updates stress test, fixed few potential issues
with tuplestore, some style fixes around.
Best regards,
Mikhail.
Attachments:
[application/octet-stream] v31-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch (21.0K, 2-v31-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch)
download | inline diff:
From bd1919d3a299ac927322d2c3d5eee1b273ba43a5 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 12 Jan 2026 00:57:56 +0300
Subject: [PATCH v31 3/7] Add Datum storage support to tuplestore Extend
tuplestore to store individual Datum values: - fixed-length datatypes and
variable-length datatypes: include a length header - by-value types: store
inline with one extra byte (but without support of random access)
This support enables usages of tuplestore for non-tuple data (TIDs) in the next commit.
---
src/backend/utils/sort/tuplestore.c | 361 +++++++++++++++++++++++-----
src/include/utils/tuplestore.h | 33 +--
2 files changed, 321 insertions(+), 73 deletions(-)
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 273a4c9b02f..3fc54deb0fd 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
/*-------------------------------------------------------------------------
*
* tuplestore.c
- * Generalized routines for temporary tuple storage.
+ * Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples. However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
*
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc. It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples. However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written. This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
* Also, it is possible to support multiple independent read pointers.
*
* A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
#include "executor/executor.h"
#include "miscadmin.h"
#include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/resowner.h"
#include "utils/tuplestore.h"
@@ -116,16 +121,15 @@ struct Tuplestorestate
BufFile *myfile; /* underlying file, or NULL if none */
MemoryContext context; /* memory context for holding tuples */
ResourceOwner resowner; /* resowner for holding temp files */
+ Oid datumType; /* InvalidOid or oid of Datum's to be stored */
+ int16 datumTypeLen; /* typelen of that Datum */
+ bool datumTypeByVal; /* by-value of that Datum */
/*
* These function pointers decouple the routines that must know what kind
* of tuple we are handling from the routines that don't need to know it.
* They are set up by the tuplestore_begin_xxx routines.
*
- * (Although tuplestore.c currently only supports heap tuples, I've copied
- * this part of tuplesort.c so that extension to other kinds of objects
- * will be easy if it's ever needed.)
- *
* Function to copy a supplied input tuple into palloc'd space. (NB: we
* assume that a single pfree() is enough to release the tuple later, so
* the representation must be "flat" in one palloc chunk.) state->availMem
@@ -150,6 +154,12 @@ struct Tuplestorestate
*/
void *(*readtup) (Tuplestorestate *state, unsigned int len);
+ /*
+ * Function to get length of tuple from tape. Used to provide 'len' argument
+ * for readtup (see above).
+ */
+ unsigned int(*lentup) (Tuplestorestate *state, bool eofOK);
+
/*
* This array holds pointers to tuples in memory if we are in state INMEM.
* In states WRITEFILE and READFILE it's not used.
@@ -186,6 +196,7 @@ struct Tuplestorestate
#define COPYTUP(state,tup) ((*(state)->copytup) (state, tup))
#define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
#define READTUP(state,len) ((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK) ((*(state)->lentup) (state, eofOK))
#define LACKMEM(state) ((state)->availMem < 0)
#define USEMEM(state,amt) ((state)->availMem -= (amt))
#define FREEMEM(state,amt) ((state)->availMem += (amt))
@@ -194,9 +205,9 @@ struct Tuplestorestate
*
* NOTES about on-tape representation of tuples:
*
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
* may or may not match the in-memory representation of the tuple ---
* any conversion needed is the job of the writetup and readtup routines.
*
@@ -207,10 +218,13 @@ struct Tuplestorestate
* state->backward is not set, the write/read routines may omit the extra
* length word.
*
- * writetup is expected to write both length words as well as the tuple
+ * In the case of Datum with constant length, both "unsigned int" are omitted.
+ *
+ * writetup is expected to write both length words and the tuple
* data. When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it is not omitted like in case of content-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
*
* The write/read routines can make use of the tuple description data
* stored in the Tuplestorestate record, if needed. They are also expected
@@ -242,11 +256,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
static void dumptuples(Tuplestorestate *state);
static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
static void *copytup_heap(Tuplestorestate *state, void *tup);
static void writetup_heap(Tuplestorestate *state, void *tup);
static void *readtup_heap(Tuplestorestate *state, unsigned int len);
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
/*
* tuplestore_begin_xxx
@@ -269,6 +288,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
state->allowedMem = maxKBytes * (int64) 1024;
state->availMem = state->allowedMem;
state->myfile = NULL;
+ /*
+ * Set Datum related data to invalid by default.
+ */
+ state->datumType = InvalidOid;
+ state->datumTypeLen = 0;
+ state->datumTypeByVal = false;
/*
* The palloc/pfree pattern for tuple memory is in a FIFO pattern. A
@@ -346,6 +371,37 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
state->copytup = copytup_heap;
state->writetup = writetup_heap;
state->readtup = readtup_heap;
+ state->lentup = lentup_heap;
+
+ return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+ Tuplestorestate *state;
+ int eflags;
+
+ /*
+ * This interpretation of the meaning of randomAccess is compatible with
+ * the pre-8.3 behavior of tuplestores.
+ */
+ eflags = randomAccess ?
+ (EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+ (EXEC_FLAG_REWIND);
+
+ state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+ state->datumType = datumType;
+ get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+ Assert(!(state->datumTypeByVal && randomAccess));
+
+ state->copytup = copytup_datum;
+ state->writetup = writetup_datum;
+ state->readtup = readtup_datum;
+ state->lentup = lentup_datum;
return state;
}
@@ -444,16 +500,19 @@ tuplestore_clear(Tuplestorestate *state)
{
int64 availMem = state->availMem;
- /*
- * Below, we reset the memory context for storing tuples. To save
- * from having to always call GetMemoryChunkSpace() on all stored
- * tuples, we adjust the availMem to forget all the tuples and just
- * recall USEMEM for the space used by the memtuples array. Here we
- * just Assert that's correct and the memory tracking hasn't gone
- * wrong anywhere.
- */
- for (i = state->memtupdeleted; i < state->memtupcount; i++)
- availMem += GetMemoryChunkSpace(state->memtuples[i]);
+ if (!state->datumTypeByVal)
+ {
+ /*
+ * Below, we reset the memory context for storing tuples. To save
+ * from having to always call GetMemoryChunkSpace() on all stored
+ * tuples, we adjust the availMem to forget all the tuples and just
+ * recall USEMEM for the space used by the memtuples array. Here we
+ * just Assert that's correct and the memory tracking hasn't gone
+ * wrong anywhere.
+ */
+ for (i = state->memtupdeleted; i < state->memtupcount; i++)
+ availMem += GetMemoryChunkSpace(state->memtuples[i]);
+ }
availMem += GetMemoryChunkSpace(state->memtuples);
@@ -777,6 +836,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
MemoryContextSwitchTo(oldcxt);
}
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+ MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+ /*
+ * Copy the Datum. (Must do this even in WRITEFILE case. Note that
+ * COPYTUP includes USEMEM, so we needn't do that here.)
+ */
+ datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+ tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+ MemoryContextSwitchTo(oldcxt);
+}
+
/*
* Similar to tuplestore_puttuple(), but work from values + nulls arrays.
* This avoids an extra tuple-construction operation.
@@ -1028,10 +1106,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
pg_fallthrough;
case TSS_READFILE:
- *should_free = true;
+ *should_free = !state->datumTypeByVal;
if (forward)
{
- if ((tuplen = getlen(state, true)) != 0)
+ if ((tuplen = LENTUP(state, true)) != 0)
{
tup = READTUP(state, tuplen);
return tup;
@@ -1043,6 +1121,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
}
}
+ Assert(!state->datumTypeByVal);
/*
* Backward.
*
@@ -1060,7 +1139,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
Assert(!state->truncated);
return NULL;
}
- tuplen = getlen(state, false);
+ tuplen = LENTUP(state, false);
if (readptr->eof_reached)
{
@@ -1091,7 +1170,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
Assert(!state->truncated);
return NULL;
}
- tuplen = getlen(state, false);
+ tuplen = LENTUP(state, false);
}
/*
@@ -1153,6 +1232,41 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
}
}
+bool
+tuplestore_getdatum(Tuplestorestate *state, bool forward,
+ bool *should_free, Datum *result)
+{
+ Datum datum;
+ *should_free = false;
+
+ datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+
+ /* For by-value datum we may receive zero as valid value. */
+ if (state->datumTypeByVal)
+ {
+ /* Return false only on EOF */
+ if (state->readptrs[state->activeptr].eof_reached)
+ {
+ *result = PointerGetDatum(NULL);
+ return false;
+ }
+
+ *result = datum;
+ return true;
+ }
+
+ if (datum)
+ {
+ *result = datum;
+ return true;
+ }
+ else
+ {
+ *result = PointerGetDatum(NULL);
+ return false;
+ }
+}
+
/*
* tuplestore_advance - exported function to adjust position without fetching
*
@@ -1173,10 +1287,20 @@ tuplestore_advance(Tuplestorestate *state, bool forward)
pfree(tuple);
return true;
}
- else
+
+ /*
+ * A NULL return normally means end-of-data, but for by-value datum
+ * stores a valid zero-valued datum (e.g., false, 0) is indistinguishable
+ * from NULL via pointer check. Use eof_reached to distinguish.
+ */
+ if (state->datumTypeByVal)
{
- return false;
+ TSReadPointer *readptr = &state->readptrs[state->activeptr];
+
+ return !readptr->eof_reached;
}
+
+ return false;
}
/*
@@ -1239,7 +1363,12 @@ tuplestore_skiptuples(Tuplestorestate *state, int64 ntuples, bool forward)
tuple = tuplestore_gettuple(state, forward, &should_free);
if (tuple == NULL)
- return false;
+ {
+ /* See tuplestore_advance for why pointer check is insufficient */
+ if (!state->datumTypeByVal ||
+ state->readptrs[state->activeptr].eof_reached)
+ return false;
+ }
if (should_free)
pfree(tuple);
CHECK_FOR_INTERRUPTS();
@@ -1461,8 +1590,11 @@ tuplestore_trim(Tuplestorestate *state)
/* Release no-longer-needed tuples */
for (i = state->memtupdeleted; i < nremove; i++)
{
- FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
- pfree(state->memtuples[i]);
+ if (!state->datumTypeByVal)
+ {
+ FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+ pfree(state->memtuples[i]);
+ }
state->memtuples[i] = NULL;
}
state->memtupdeleted = nremove;
@@ -1557,25 +1689,6 @@ tuplestore_in_memory(Tuplestorestate *state)
return (state->status == TSS_INMEM);
}
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
- unsigned int len;
- size_t nbytes;
-
- nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
- if (nbytes == 0)
- return 0;
- else
- return len;
-}
-
-
/*
* Routines specialized for HeapTuple case
*
@@ -1586,6 +1699,19 @@ getlen(Tuplestorestate *state, bool eofOK)
* to write that separately.
*/
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+ unsigned int len;
+ size_t nbytes;
+
+ nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+ if (nbytes == 0)
+ return 0;
+ else
+ return len;
+}
+
static void *
copytup_heap(Tuplestorestate *state, void *tup)
{
@@ -1632,3 +1758,122 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
return tuple;
}
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length and Variable-length includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying, storing single extra byte
+ * XXX: consider refactoring to avoid it, currently need it for correct rewind logic
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+ unsigned int len;
+ size_t nbytes;
+
+ Assert(state->datumType != InvalidOid);
+
+ if (state->datumTypeByVal)
+ {
+ uint8_t junk;
+ nbytes = BufFileReadMaybeEOF(state->myfile, &junk, sizeof(uint8_t), eofOK);
+ if (nbytes == 0)
+ return 0;
+ Assert(junk == (uint8_t) state->datumTypeLen);
+ return state->datumTypeLen;
+ }
+
+ nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+ if (nbytes == 0)
+ return 0;
+ return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void *datum)
+{
+ Datum d;
+ Assert(state->datumType != InvalidOid);
+ if (state->datumTypeByVal)
+ return DatumGetPointer(PointerGetDatum(datum));
+
+ d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+ USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+ return DatumGetPointer(d);
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void *datum)
+{
+ Assert(state->datumType != InvalidOid);
+ if (state->datumTypeByVal)
+ {
+ uint8_t junk = state->datumTypeLen; /* overflow is ok */
+ Datum v;
+ Assert(state->datumTypeLen > 0);
+
+ /* just marker byte used to track the end of data for rewind logic */
+ BufFileWrite(state->myfile, &junk, sizeof(junk));
+ store_att_byval(&v, PointerGetDatum(datum), state->datumTypeLen);
+ BufFileWrite(state->myfile, &v, state->datumTypeLen);
+ Assert(!state->backward);
+ }
+ else
+ {
+ unsigned int size = state->datumTypeLen;
+ unsigned int tuplen;
+
+ if (state->datumTypeLen < 0)
+ size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+
+ /*
+ * Include sizeof(unsigned int) in the stored length, matching the
+ * convention used by writetup_heap. The backward-scan seek
+ * arithmetic in tuplestore_gettuple assumes this.
+ */
+ tuplen = size + sizeof(unsigned int);
+ BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+ BufFileWrite(state->myfile, datum, size);
+
+ /* need trailing length word? */
+ if (state->backward)
+ BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+ FREEMEM(state, GetMemoryChunkSpace(datum));
+ pfree(datum);
+ }
+}
+
+static void *
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+ Assert(state->datumType != InvalidOid);
+ if (state->datumTypeByVal)
+ {
+ Datum datum;
+
+ Assert(state->datumTypeLen > 0);
+ Assert(len == state->datumTypeLen);
+ BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+
+ Assert(!state->backward);
+ return DatumGetPointer(fetch_att(&datum, true, state->datumTypeLen));
+ }
+ else
+ {
+ unsigned int datalen = len - sizeof(unsigned int);
+ Datum *data = palloc(datalen);
+
+ BufFileReadExact(state->myfile, data, datalen);
+
+ /* need trailing length word? */
+ if (state->backward)
+ BufFileReadExact(state->myfile, &len, sizeof(len));
+
+ return data;
+ }
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 1c08e219e89..665d6d57635 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
/*-------------------------------------------------------------------------
*
* tuplestore.h
- * Generalized routines for temporary tuple storage.
+ * Generalized routines for temporary storage of tuples and Datums.
*
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc. It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples. However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written. This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples. However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
*
* A temporary file is used to handle the data if it exceeds the
* space limit specified by the caller.
@@ -39,14 +40,13 @@
*/
typedef struct Tuplestorestate Tuplestorestate;
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
bool interXact,
int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+ bool randomAccess,
+ bool interXact,
+ int maxKBytes);
extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
extern int tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+ bool *should_free, Datum *result);
extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
--
2.43.0
[application/octet-stream] v31-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.4K, 3-v31-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch)
download | inline diff:
From 2b1a423175ca6893b56bda69e1827e595f22df5e Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sun, 11 Jan 2026 19:27:52 +0300
Subject: [PATCH v31 2/7] Add STIR access method and flags related to auxiliary
indexes
This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR (Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage
STIR functions are designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.
Planned to be used in following commits.
---
contrib/pgstattuple/pgstattuple.c | 3 +
src/backend/access/Makefile | 1 +
src/backend/access/heap/vacuumlazy.c | 2 +
src/backend/access/meson.build | 1 +
src/backend/access/stir/Makefile | 18 +
src/backend/access/stir/meson.build | 5 +
src/backend/access/stir/stir.c | 565 +++++++++++++++++++++++
src/backend/catalog/index.c | 1 +
src/backend/catalog/toasting.c | 1 +
src/backend/commands/analyze.c | 1 +
src/backend/commands/vacuumparallel.c | 1 +
src/backend/nodes/makefuncs.c | 1 +
src/include/access/genam.h | 1 +
src/include/access/reloptions.h | 3 +-
src/include/access/stir.h | 113 +++++
src/include/catalog/pg_am.dat | 3 +
src/include/catalog/pg_opclass.dat | 4 +
src/include/catalog/pg_opfamily.dat | 2 +
src/include/catalog/pg_proc.dat | 4 +
src/include/nodes/execnodes.h | 7 +-
src/include/utils/index_selfuncs.h | 8 +
src/test/regress/expected/amutils.out | 8 +-
src/test/regress/expected/opr_sanity.out | 7 +-
src/test/regress/expected/psql.out | 24 +-
24 files changed, 766 insertions(+), 18 deletions(-)
create mode 100644 src/backend/access/stir/Makefile
create mode 100644 src/backend/access/stir/meson.build
create mode 100644 src/backend/access/stir/stir.c
create mode 100644 src/include/access/stir.h
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
case SPGIST_AM_OID:
err = "spgist index";
break;
+ case STIR_AM_OID:
+ err = "stir index";
+ break;
case BRIN_AM_OID:
err = "brin index";
break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index e88d72ea039..ebbcfa90715 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
nbtree \
rmgrdesc \
spgist \
+ stir \
sequence \
table \
tablesample \
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 82c5b28e0ad..f1785b9a456 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3138,6 +3138,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
ivinfo.message_level = DEBUG2;
ivinfo.num_heap_tuples = reltuples;
ivinfo.strategy = vacrel->bstrategy;
+ ivinfo.validate_index = false;
/*
* Update error traceback information.
@@ -3189,6 +3190,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
ivinfo.num_heap_tuples = reltuples;
ivinfo.strategy = vacrel->bstrategy;
+ ivinfo.validate_index = false;
/*
* Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 5fd18de74f9..7219c65f365 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
subdir('rmgrdesc')
subdir('sequence')
subdir('spgist')
+subdir('stir')
subdir('table')
subdir('tablesample')
subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..8785dab37bd
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/stir
+#
+# IDENTIFICATION
+# src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ stir.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..4b7ad15346c
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+backend_sources += files(
+ 'stir.c',
+)
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..f21b229de42
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,565 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ * Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurrent index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 4. gets dropped
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/stir.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+ IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+ /* Set STIR-specific strategy and procedure numbers */
+ amroutine->amstrategies = STIR_NSTRATEGIES;
+ amroutine->amsupport = STIR_NPROC;
+ amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+ /* STIR doesn't support most index operations */
+ amroutine->amcanorder = false;
+ amroutine->amcanorderbyop = false;
+ amroutine->amcanbackward = false;
+ amroutine->amcanunique = false;
+ amroutine->amcanmulticol = true;
+ amroutine->amoptionalkey = true;
+ amroutine->amsearcharray = false;
+ amroutine->amsearchnulls = false;
+ amroutine->amstorage = false;
+ amroutine->amclusterable = false;
+ amroutine->ampredlocks = false;
+ amroutine->amcanparallel = false;
+ amroutine->amcanbuildparallel = false;
+ amroutine->amcaninclude = true;
+ amroutine->amusemaintenanceworkmem = false;
+ amroutine->amparallelvacuumoptions = VACUUM_OPTION_NO_PARALLEL;
+ amroutine->amkeytype = InvalidOid;
+
+ /* Set up function callbacks */
+ amroutine->ambuild = stirbuild;
+ amroutine->ambuildempty = stirbuildempty;
+ amroutine->aminsert = stirinsert;
+ amroutine->aminsertcleanup = NULL;
+ amroutine->ambulkdelete = stirbulkdelete;
+ amroutine->amvacuumcleanup = stirvacuumcleanup;
+ amroutine->amcanreturn = NULL;
+ amroutine->amcostestimate = stircostestimate;
+ amroutine->amoptions = stiroptions;
+ amroutine->amproperty = NULL;
+ amroutine->ambuildphasename = NULL;
+ amroutine->amvalidate = stirvalidate;
+ amroutine->amadjustmembers = NULL;
+ amroutine->ambeginscan = stirbeginscan;
+ amroutine->amrescan = stirrescan;
+ amroutine->amgettuple = NULL;
+ amroutine->amgetbitmap = NULL;
+ amroutine->amendscan = stirendscan;
+ amroutine->ammarkpos = NULL;
+ amroutine->amrestrpos = NULL;
+ amroutine->amestimateparallelscan = NULL;
+ amroutine->aminitparallelscan = NULL;
+ amroutine->amparallelrescan = NULL;
+
+ PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not a real index, so validate may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+ bool result = true;
+ HeapTuple classtup;
+ Form_pg_opclass classform;
+ Oid opfamilyoid;
+ HeapTuple familytup;
+ Form_pg_opfamily familyform;
+ char *opfamilyname;
+ CatCList *oprlist;
+ int i;
+
+ /* Fetch opclass information */
+ classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+ if (!HeapTupleIsValid(classtup))
+ elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+ classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+ opfamilyoid = classform->opcfamily;
+
+ /* Fetch opfamily information */
+ familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+ if (!HeapTupleIsValid(familytup))
+ elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+ familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+ opfamilyname = NameStr(familyform->opfname);
+
+ /* Fetch all operators and support functions of the opfamily */
+ oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+
+ /* Check individual operators */
+ for (i = 0; i < oprlist->n_members; i++)
+ {
+ HeapTuple oprtup = &oprlist->members[i]->tuple;
+ Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+ /* Check it's allowed strategy for stir */
+ if (oprform->amopstrategy < 1 ||
+ oprform->amopstrategy > STIR_NSTRATEGIES)
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+ opfamilyname,
+ format_operator(oprform->amopopr),
+ oprform->amopstrategy)));
+ result = false;
+ }
+
+ /* stir doesn't support ORDER BY operators */
+ if (oprform->amoppurpose != AMOP_SEARCH ||
+ OidIsValid(oprform->amopsortfamily))
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+ opfamilyname,
+ format_operator(oprform->amopopr))));
+ result = false;
+ }
+
+ /* Check operator signature --- same for all stir strategies */
+ if (!check_amop_signature(oprform->amopopr, BOOLOID,
+ oprform->amoplefttype,
+ oprform->amoprighttype))
+ {
+ ereport(INFO,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("stir opfamily %s contains operator %s with wrong signature",
+ opfamilyname,
+ format_operator(oprform->amopopr))));
+ result = false;
+ }
+ }
+
+ ReleaseCatCacheList(oprlist);
+ ReleaseSysCache(familytup);
+ ReleaseSysCache(classtup);
+
+ return result;
+}
+
+/*
+ * Initialize meta-page of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+ StirMetaPageData *metadata;
+
+ StirInitPage(metaPage, STIR_META);
+ metadata = StirPageGetMeta(metaPage);
+ memset(metadata, 0, sizeof(StirMetaPageData));
+ metadata->magicNumber = STIR_MAGIC_NUMBER;
+ metadata->skipInserts = skipInserts;
+ ((PageHeader) metaPage)->pd_lower = ((char *) metadata + sizeof(StirMetaPageData)) - (char *) metaPage;
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+ Buffer metaBuffer;
+ Page metaPage;
+
+ Assert(!RelationNeedsWAL(index));
+ /*
+ * Make a new page; since it is the first page it should be associated with
+ * block number 0 (STIR_METAPAGE_BLKNO). No need to hold the extension
+ * lock because there cannot be concurrent inserters yet.
+ */
+ metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+ START_CRIT_SECTION();
+ Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+ metaPage = BufferGetPage(metaBuffer);
+ StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+ MarkBufferDirty(metaBuffer);
+ END_CRIT_SECTION();
+ UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+ StirPageOpaque opaque;
+
+ PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+ opaque = StirPageGetOpaque(page);
+ opaque->flags = flags;
+ opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if the tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+ StirTuple *itup;
+ StirPageOpaque opaque;
+ char *ptr;
+
+ /* We shouldn't be pointed to an invalid page */
+ Assert(!PageIsNew(page));
+
+ /* Does the new tuple fit on the page? */
+ if (StirPageGetFreeSpace(page) < sizeof(StirTuple))
+ return false;
+
+ /* Copy a new tuple to the end of the page */
+ opaque = StirPageGetOpaque(page);
+ itup = StirPageGetTuple(page, opaque->maxoff + 1);
+ memcpy(itup, tuple, sizeof(StirTuple));
+
+ /* Adjust maxoff and pd_lower */
+ opaque->maxoff++;
+ ptr = (char *) StirPageGetTuple(page, opaque->maxoff + 1);
+ ((PageHeader) page)->pd_lower = ptr - page;
+
+ /* Assert we didn't overrun available space */
+ Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+ return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo)
+{
+ StirTuple itup;
+ StirMetaPageData *metaData;
+ Buffer buffer,
+ metaBuffer;
+ Page page;
+ BlockNumber blkNo;
+
+ itup.heapPtr = *ht_ctid;
+
+ Assert(!RelationNeedsWAL(index));
+ metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+ for (;;)
+ {
+ LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+ metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+ /* Check if inserts are allowed */
+ if (metaData->skipInserts)
+ {
+ UnlockReleaseBuffer(metaBuffer);
+ return false;
+ }
+ blkNo = metaData->lastBlkNo;
+ /* Don't hold metabuffer lock while doing insert */
+ LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+ if (blkNo > 0)
+ {
+ buffer = ReadBuffer(index, blkNo);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ START_CRIT_SECTION();
+
+ page = BufferGetPage(buffer);
+
+ Assert(!PageIsNew(page));
+
+ /* Try to add tuple to the existing page */
+ if (StirPageAddItem(page, &itup))
+ {
+ /* Success! Apply the change, clean up, and exit */
+ MarkBufferDirty(buffer);
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buffer);
+ ReleaseBuffer(metaBuffer);
+ return false;
+ }
+
+ END_CRIT_SECTION();
+ UnlockReleaseBuffer(buffer);
+ }
+
+ /* Need to add a new page - get exclusive lock on meta-page */
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+
+ /* Re-check after acquiring exclusive lock */
+ if (metaData->skipInserts)
+ {
+ UnlockReleaseBuffer(metaBuffer);
+ return false;
+ }
+
+ /* Check if another backend already extended the index */
+ if (blkNo != metaData->lastBlkNo)
+ {
+ Assert(blkNo < metaData->lastBlkNo);
+ /* Someone else inserted the new page into the index, let's try again */
+ LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+ continue;
+ }
+ else
+ {
+ /* Must extend the file */
+ buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+ EB_LOCK_FIRST);
+ page = BufferGetPage(buffer);
+ START_CRIT_SECTION();
+
+ StirInitPage(page, 0);
+
+ if (!StirPageAddItem(page, &itup))
+ {
+ /* We shouldn't be here since we're inserting to an empty page */
+ elog(ERROR, "could not add new stir tuple to empty page");
+ }
+
+ /* Update meta-page with new last block number */
+ metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+ MarkBufferDirty(metaBuffer);
+ MarkBufferDirty(buffer);
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buffer);
+ UnlockReleaseBuffer(metaBuffer);
+
+ return false;
+ }
+ }
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+ ScanKey orderbys, int norderbys)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta-page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+ struct IndexInfo *indexInfo)
+{
+ IndexBuildResult *result;
+
+ if (!indexInfo->ii_Auxiliary)
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("Building STIR indexes is not supported")));
+
+ StirInitMetapage(index, MAIN_FORKNUM);
+
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+ result->heap_tuples = 0;
+ result->index_tuples = 0;
+ return result;
+}
+
+void stirbuildempty(Relation index)
+{
+ StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats,
+ IndexBulkDeleteCallback callback,
+ void *callback_state)
+{
+ Relation index = info->index;
+ BlockNumber blkno, npages;
+ Buffer buffer;
+ Page page;
+
+ /* For normal VACUUM, mark to skip inserts and warn about an index drop needed */
+ if (!info->validate_index)
+ {
+ StirMarkAsSkipInserts(index);
+
+ ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+ return NULL;
+ }
+
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ /*
+ * Iterate over the pages. We don't care about concurrently added pages,
+ * because the index is marked as not-ready for that moment and the index is not
+ * used for insert.
+ */
+ npages = RelationGetNumberOfBlocks(index);
+ for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+ {
+ StirTuple *itup, *itupEnd;
+
+ vacuum_delay_point(false);
+
+ buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+ RBM_NORMAL, info->strategy);
+
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buffer);
+
+ if (PageIsNew(page))
+ {
+ UnlockReleaseBuffer(buffer);
+ continue;
+ }
+
+ itup = StirPageGetTuple(page, FirstOffsetNumber);
+ itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+ while (itup < itupEnd)
+ {
+ /* Do we have to delete this tuple? */
+ if (callback(&itup->heapPtr, callback_state))
+ {
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+ }
+
+ itup = StirPageGetNextTuple(itup);
+ }
+
+ UnlockReleaseBuffer(buffer);
+ }
+
+ return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+ StirMetaPageData *metaData;
+ Buffer metaBuffer;
+ Page metaPage;
+
+ Assert(!RelationNeedsWAL(index));
+ metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+ START_CRIT_SECTION();
+
+ metaPage = BufferGetPage(metaBuffer);
+ metaData = StirPageGetMeta(metaPage);
+
+ if (!metaData->skipInserts)
+ {
+ metaData->skipInserts = true;
+ MarkBufferDirty(metaBuffer);
+ }
+ END_CRIT_SECTION();
+ UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats)
+{
+ StirMarkAsSkipInserts(info->index);
+ ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+ return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+ return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+ double loop_count, Cost *indexStartupCost,
+ Cost *indexTotalCost, Selectivity *indexSelectivity,
+ double *indexCorrelation, double *indexPages)
+{
+ ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8b3c60d91f9..f5484c59d18 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3412,6 +3412,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
ivinfo.message_level = DEBUG2;
ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
ivinfo.strategy = NULL;
+ ivinfo.validate_index = true;
/*
* Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 078a1cf5127..c33e43df1ec 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -313,6 +313,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
indexInfo->ii_ParallelWorkers = 0;
indexInfo->ii_Am = BTREE_AM_OID;
indexInfo->ii_AmCache = NULL;
+ indexInfo->ii_Auxiliary = false;
indexInfo->ii_Context = CurrentMemoryContext;
collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index eeed91be266..1fbe70d187c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -726,6 +726,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
ivinfo.message_level = elevel;
ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
ivinfo.strategy = vac_strategy;
+ ivinfo.validate_index = false;
stats = index_vacuum_cleanup(&ivinfo, NULL);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 279108ca89f..dfdccfaf991 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -885,6 +885,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
ivinfo.estimated_count = pvs->shared->estimated_count;
ivinfo.num_heap_tuples = pvs->shared->reltuples;
ivinfo.strategy = pvs->bstrategy;
+ ivinfo.validate_index = false;
/* Update error traceback information */
pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 3cd35c5c457..5359dab1176 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
/* initialize index-build state to default */
n->ii_BrokenHotChain = false;
n->ii_ParallelWorkers = 0;
+ n->ii_Auxiliary = false;
/* set up for possible use by index AM */
n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1a27bf060b3..0356901ee10 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -58,6 +58,7 @@ typedef struct IndexVacuumInfo
bool estimated_count; /* num_heap_tuples is an estimate */
int message_level; /* ereport level for progress messages */
double num_heap_tuples; /* tuples remaining in heap */
+ bool validate_index; /* validating concurrently built index? */
BufferAccessStrategy strategy; /* access strategy for reads */
} IndexVacuumInfo;
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 0bd17b30ca7..e2966165e6f 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -52,8 +52,9 @@ typedef enum relopt_kind
RELOPT_KIND_VIEW = (1 << 9),
RELOPT_KIND_BRIN = (1 << 10),
RELOPT_KIND_PARTITIONED = (1 << 11),
+ RELOPT_KIND_STIR = (1 << 12),
/* if you add a new kind, make sure you update "last_default" too */
- RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+ RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..18ee36506fd
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,113 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ * header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STIR_H
+#define STIR_H
+
+#include "access/amapi.h"
+#include "access/xlog.h"
+#include "access/generic_xlog.h"
+#include "access/itup.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC 0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES 1
+
+#define STIR_OPTIONS_PROC 0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+ ((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page) ((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+ ((StirTuple *)(PageGetContents(page) \
+ + sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+ ((StirTuple *)((char *)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO (0)
+#define STIR_HEAD_BLKNO (1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+ OffsetNumber maxoff; /* number of index tuples on the page */
+ uint16 flags; /* see bit definitions below */
+ uint16 stir_page_id; /* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META (1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID 0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+ uint32 magicNumber;
+ BlockNumber lastBlkNo;
+ bool skipInserts; /* should we just exit without any inserts? */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGIC_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page) ((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+ ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(page) \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+ - StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+ - MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+ ItemPointer ht_ctid, Relation heapRel,
+ IndexUniqueCheck checkUnique,
+ bool indexUnchanged,
+ struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+ ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+ struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+ void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+ IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif /* STIR_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 46d361047fe..8bd2c2b46ba 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
{ oid => '3580', oid_symbol => 'BRIN_AM_OID',
descr => 'block range index (BRIN) access method',
amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+ descr => 'short term index replacement access method',
+ amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index df170b80840..a3457e749db 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -492,4 +492,8 @@
# no brin opclass for the geometric types except box
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+ opcfamily => 'stir/any_ops', opcintype => 'any'},
+
]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index 7a027c4810e..6ffc20a061c 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -308,5 +308,7 @@
opfmethod => 'hash', opfname => 'multirange_ops' },
{ oid => '6158',
opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+ opfmethod => 'stir', opfname => 'any_ops' },
]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fc8d82665b8..bac9a148700 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
proname => 'brinhandler', provolatile => 'v',
prorettype => 'index_am_handler', proargtypes => 'internal',
prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+ proname => 'stirhandler', provolatile => 'v',
+ prorettype => 'index_am_handler', proargtypes => 'internal',
+ prosrc => 'stirhandler' },
{ oid => '3952', descr => 'brin: standalone scan new table pages',
proname => 'brin_summarize_new_values', provolatile => 'v',
proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0716c5a9aed..0f834889912 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -166,8 +166,8 @@ typedef struct ExprState
* entries for a particular index. Used for both index_build and
* retail creation of index entries.
*
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
* ----------------
*/
typedef struct IndexInfo
@@ -227,7 +227,8 @@ typedef struct IndexInfo
bool ii_WithoutOverlaps;
/* # of workers requested (excludes leader) */
int ii_ParallelWorkers;
-
+ /* is auxiliary for concurrent index build? */
+ bool ii_Auxiliary;
/* Oid of index AM */
Oid ii_Am;
/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 74793a1a19d..bf0e30dabe9 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
Selectivity *indexSelectivity,
double *indexCorrelation,
double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+ struct IndexPath *path,
+ double loop_count,
+ Cost *indexStartupCost,
+ Cost *indexTotalCost,
+ Selectivity *indexSelectivity,
+ double *indexCorrelation,
+ double *indexPages);
extern void gincostestimate(struct PlannerInfo *root,
struct IndexPath *path,
double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
spgist | can_exclude | t
spgist | can_include | t
spgist | bogus |
-(36 rows)
+ stir | can_order | f
+ stir | can_unique | f
+ stir | can_multi_col | t
+ stir | can_exclude | f
+ stir | can_include | t
+ stir | bogus |
+(42 rows)
--
-- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 6ff4d7ee901..9259679eea2 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2129,9 +2129,10 @@ FROM pg_opclass AS c1
WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
WHERE a1.amopfamily = c1.opcfamily
AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily
----------+-----------
-(0 rows)
+ opcname | opcfamily
+----------+-----------
+ stir_ops | 5558
+(1 row)
-- Check that each operator listed in pg_amop has an associated opclass,
-- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
heap | Table
heap2 | Table
spgist | Index
-(8 rows)
+ stir | Index
+(9 rows)
\dA *
List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
heap | Table
heap2 | Table
spgist | Index
-(8 rows)
+ stir | Index
+(9 rows)
\dA h*
List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
\dA: extra argument "bar" ignored
\dA+
- List of access methods
- Name | Type | Handler | Description
---------+-------+----------------------+----------------------------------------
+ List of access methods
+ Name | Type | Handler | Description
+--------+-------+----------------------+--------------------------------------------
brin | Index | brinhandler | block range index (BRIN) access method
btree | Index | bthandler | b-tree index access method
gin | Index | ginhandler | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
heap | Table | heap_tableam_handler | heap table access method
heap2 | Table | heap_tableam_handler |
spgist | Index | spghandler | SP-GiST index access method
-(8 rows)
+ stir | Index | stirhandler | short term index replacement access method
+(9 rows)
\dA+ *
- List of access methods
- Name | Type | Handler | Description
---------+-------+----------------------+----------------------------------------
+ List of access methods
+ Name | Type | Handler | Description
+--------+-------+----------------------+--------------------------------------------
brin | Index | brinhandler | block range index (BRIN) access method
btree | Index | bthandler | b-tree index access method
gin | Index | ginhandler | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
heap | Table | heap_tableam_handler | heap table access method
heap2 | Table | heap_tableam_handler |
spgist | Index | spghandler | SP-GiST index access method
-(8 rows)
+ stir | Index | stirhandler | short term index replacement access method
+(9 rows)
\dA+ h*
List of access methods
--
2.43.0
[application/octet-stream] v31-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch (94.9K, 4-v31-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
download | inline diff:
From a1b4fb5ced0e25ab86dfbb628b94ce0b69c23019 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v31 4/7] Use auxiliary indexes for concurrent index operations
Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready
To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index
This eliminates the need for a second full table scan during validation, improving performance, especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
doc/src/sgml/monitoring.sgml | 26 +-
doc/src/sgml/ref/create_index.sgml | 34 +-
doc/src/sgml/ref/reindex.sgml | 40 +-
src/backend/access/heap/README.HOT | 13 +-
src/backend/access/heap/heapam_handler.c | 553 ++++++++++++++-------
src/backend/catalog/index.c | 308 ++++++++++--
src/backend/catalog/system_views.sql | 17 +-
src/backend/commands/indexcmds.c | 344 +++++++++++--
src/backend/nodes/makefuncs.c | 4 +-
src/include/access/tableam.h | 12 +-
src/include/catalog/index.h | 9 +-
src/include/commands/progress.h | 13 +-
src/include/nodes/makefuncs.h | 3 +-
src/test/regress/expected/create_index.out | 42 ++
src/test/regress/expected/indexing.out | 3 +-
src/test/regress/expected/rules.out | 17 +-
src/test/regress/sql/create_index.sql | 21 +
17 files changed, 1123 insertions(+), 336 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 462019a972c..b8031a3cb39 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6677,6 +6677,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
information for this phase.
</entry>
</row>
+ <row>
+ <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+ <entry>
+ <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+ with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+ future transactions.
+ This phase is skipped when not in concurrent mode.
+ Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+ and <structname>current_locker_pid</structname> contain the progress
+ information for this phase.
+ </entry>
+ </row>
<row>
<entry><literal>building index</literal></entry>
<entry>
@@ -6717,13 +6729,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
</entry>
</row>
<row>
- <entry><literal>index validation: scanning table</literal></entry>
+ <entry><literal>index validation: merging indexes</literal></entry>
<entry>
- <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
- to validate the index tuples collected in the previous two phases.
+ <command>CREATE INDEX CONCURRENTLY</command> is merging content of auxiliary index with the target index.
This phase is skipped when not in concurrent mode.
- Columns <structname>blocks_total</structname> (set to the total size of the table)
- and <structname>blocks_done</structname> contain the progress information for this phase.
+ Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+ and <structname>tuples_done</structname> contain the progress information for this phase.
</entry>
</row>
<row>
@@ -6740,8 +6751,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
<row>
<entry><literal>waiting for readers before marking dead</literal></entry>
<entry>
- <command>REINDEX CONCURRENTLY</command> is waiting for transactions
- with read locks on the table to finish, before marking the old index dead.
+ <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+ with read locks on the table to finish, before marking the auxiliary index as dead.
+ <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
This phase is skipped when not in concurrent mode.
Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..12c88587a79 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
out writes. This method is invoked by specifying the
<literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
When this option is used,
- <productname>PostgreSQL</productname> must perform two scans of the table, and in
- addition it must wait for all existing transactions that could potentially
- modify or use the index to terminate. Thus
- this method requires more total work than a standard index build and takes
+ <productname>PostgreSQL</productname> must perform table scan followed by
+ validation phase, and in addition it must wait for all existing transactions
+ that could potentially modify or use the index to terminate. Thus
+ this method requires more total work than a standard index build and may take
significantly longer to complete. However, since it allows normal
operations to continue while the index is built, this method is useful for
adding new indexes in a production environment. Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
</para>
<para>
- In a concurrent index build, the index is actually entered as an
- <quote>invalid</quote> index into
- the system catalogs in one transaction, then two table scans occur in
- two more transactions. Before each table scan, the index build must
+ In a concurrent index build, the main and auxiliary indexes are actually
+ entered as an <quote>invalid</quote> index into
+ the system catalogs in one transaction, then two phases occur in
+ multiple transactions. Before each phase, the index build must
wait for existing transactions that have modified the table to terminate.
- After the second scan, the index build must wait for any transactions
+ After the second phase, the index build must wait for any transactions
that have a snapshot (see <xref linkend="mvcc"/>) predating the second
- scan to terminate, including transactions used by any phase of concurrent
+ phase to terminate, including transactions used by any phase of concurrent
index builds on other tables, if the indexes involved are partial or have
columns that are not simple column references.
Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
<para>
If a problem arises while scanning the table, such as a deadlock or a
uniqueness violation in a unique index, the <command>CREATE INDEX</command>
- command will fail but leave behind an <quote>invalid</quote> index. This index
- will be ignored for querying purposes because it might be incomplete;
- however it will still consume update overhead. The <application>psql</application>
- <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+ command will fail but leave behind an <quote>invalid</quote> index and its
+ associated auxiliary index. These indexes
+ will be ignored for querying purposes because they might be incomplete;
+ however they will still consume update overhead. The <application>psql</application>
+ <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
<programlisting>
postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
col | integer | | |
Indexes:
"idx" btree (col) INVALID
+ "idx_ccaux" stir (col) INVALID
</programlisting>
The recommended recovery
- method in such cases is to drop the index and try again to perform
- <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
+ method in such cases is to drop these indexes and try again to perform
+ <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
</para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..9e0248261ae 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
<productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
of writes. This method is invoked by specifying the
<literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
- is used, <productname>PostgreSQL</productname> must perform two scans of the table
- for each index that needs to be rebuilt and wait for termination of
- all existing transactions that could potentially use the index.
+ is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+ consistency while allowing normal operations to continue.
This method requires more total work than a standard index
rebuild and takes significantly longer to complete as it needs to wait
for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
<orderedlist>
<listitem>
<para>
- A new transient index definition is added to the catalog
+ A new transient index definition and an auxiliary index are added to the catalog
<literal>pg_index</literal>. This definition will be used to replace
the old index. A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
<listitem>
<para>
- A first pass to build the index is done for each new index. Once the
+ The auxiliary index is marked as "ready for inserts", making
+ it visible to other sessions. This index efficiently tracks all new
+ tuples during the reindex process.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The new main index is built by scanning the table. Once the
index is built, its flag <literal>pg_index.indisready</literal> is
switched to <quote>true</quote> to make it ready for inserts, making it
visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
<listitem>
<para>
- Then a second pass is performed to add tuples that were added while the
- first pass was running. This step is also done in a separate
- transaction for each index.
+ A validation phase merges any missing entries from the auxiliary index
+ into the main index, ensuring all concurrent changes are captured.
+ This step is also done in a separate transaction for each index.
</para>
</listitem>
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
<listitem>
<para>
- The old indexes have <literal>pg_index.indisready</literal> switched to
+ The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
<quote>false</quote> to prevent any new tuple insertions, after waiting
for running queries that might reference the old index to complete.
</para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
<listitem>
<para>
- The old indexes are dropped. The <literal>SHARE UPDATE
+ The old and auxiliary indexes are dropped. The <literal>SHARE UPDATE
EXCLUSIVE</literal> session locks for the indexes and the table are
released.
</para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
<para>
If a problem arises while rebuilding the indexes, such as a
uniqueness violation in a unique index, the <command>REINDEX</command>
- command will fail but leave behind an <quote>invalid</quote> new index in addition to
- the pre-existing one. This index will be ignored for querying purposes
- because it might be incomplete; however it will still consume update
+ command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+ the pre-existing one. These indexes will be ignored for querying purposes
+ because they might be incomplete; however they will still consume update
overhead. The <application>psql</application> <command>\d</command> command will report
- such an index as <literal>INVALID</literal>:
+ such indexes as <literal>INVALID</literal>:
<programlisting>
postgres=# \d tab
@@ -462,12 +469,13 @@ postgres=# \d tab
Indexes:
"idx" btree (col)
"idx_ccnew" btree (col) INVALID
+ "idx_ccaux" stir (col) INVALID
</programlisting>
If the index marked <literal>INVALID</literal> is suffixed
- <literal>_ccnew</literal>, then it corresponds to the transient
+ <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
index created during the concurrent operation, and the recommended
- recovery method is to drop it using <literal>DROP INDEX</literal>,
+ recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
then attempt <command>REINDEX CONCURRENTLY</command> again.
If the invalid index is instead suffixed <literal>_ccold</literal>,
it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..b1c797517ee 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT. Other transactions must include
such an index when determining HOT-safety of updates, even though they
must ignore it for both insertion and searching purposes.
+Also, special auxiliary index is created the same way. It is marked as
+"ready for inserts" without any actual table scan. Its purpose is to collect
+new tuples inserted into table while our target index is still "not ready
+for inserts".
+
We must do this to avoid making incorrect index entries. For example,
suppose we are building an index on column X and we make an index entry for
a non-HOT tuple with X=1. Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
live tuple.
We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open. Then we take
-a second reference snapshot and validate the index. This searches for
-tuples missing from the index, and inserts any missing ones. Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open. Then validate
+the index. This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if they are visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
the value to be inserted is the one from the live tuple.
Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 253a735b6c1..f90310a1ab8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,15 +41,17 @@
#include "storage/bufpage.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "storage/proc.h"
#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/builtins.h"
#include "utils/rel.h"
#include "utils/tuplesort.h"
+#include "utils/tuplestore.h"
static void reform_and_rewrite_tuple(HeapTuple tuple,
- Relation OldHeap, Relation NewHeap,
- Datum *values, bool *isnull, RewriteState rwstate);
+ Relation OldHeap, Relation NewHeap,
+ Datum *values, bool *isnull, RewriteState rwstate);
static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
HeapTuple tuple,
@@ -1768,242 +1770,409 @@ heapam_index_build_range_scan(Relation heapRelation,
return reltuples;
}
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxiliary tuplesort but not in
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int64
+heapam_index_validate_tuplesort_difference(Tuplesortstate *main,
+ Tuplesortstate *aux,
+ Tuplestorestate *store)
+{
+ int64 num = 0;
+ /* state variables for the merge */
+ ItemPointer indexcursor = NULL,
+ auxindexcursor = NULL;
+ ItemPointerData decoded,
+ auxdecoded;
+ bool tuplesort_empty = false,
+ auxtuplesort_empty = false;
+
+ /* Initialize pointers. */
+ ItemPointerSetInvalid(&decoded);
+ ItemPointerSetInvalid(&auxdecoded);
+
+ /*
+ * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+ * which holds TIDs that must compared to those from the "main" sort
+ * (state->tuplesort).
+ */
+ while (!auxtuplesort_empty)
+ {
+ Datum ts_val;
+ bool ts_isnull;
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Attempt to fetch the next TID from the auxiliary sort. If it's
+ * empty, we set auxindexcursor to NULL.
+ */
+ auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+ false, &ts_val, &ts_isnull,
+ NULL);
+ Assert(auxtuplesort_empty || !ts_isnull);
+ if (!auxtuplesort_empty)
+ {
+ itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+ auxindexcursor = &auxdecoded;
+ }
+ else
+ {
+ auxindexcursor = NULL;
+ }
+
+ /*
+ * If the auxiliary sort is not yet empty, we now try to synchronize
+ * the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+ * the main sort cursor until we've reached or passed the auxiliary TID.
+ */
+ if (!auxtuplesort_empty)
+ {
+ /*
+ * Move the main sort forward while:
+ * (1) It's not exhausted (tuplesort_empty == false), and
+ * (2) Either indexcursor is NULL (first iteration) or
+ * indexcursor < auxindexcursor in TID order.
+ */
+ while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+ ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+ {
+ /*
+ * Get the next TID from the main sort. If it's empty,
+ * we set indexcursor to NULL.
+ */
+ tuplesort_empty = !tuplesort_getdatum(main, true,
+ false, &ts_val, &ts_isnull,
+ NULL);
+ Assert(tuplesort_empty || !ts_isnull);
+
+ if (!tuplesort_empty)
+ {
+ itemptr_decode(&decoded, DatumGetInt64(ts_val));
+ indexcursor = &decoded;
+ }
+ else
+ {
+ indexcursor = NULL;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ /*
+ * Now, if either:
+ * - the main sort is empty, or
+ * - indexcursor > auxindexcursor,
+ *
+ * then auxindexcursor identifies a TID that doesn't appear in
+ * the main sort. We likely need to insert it
+ * into the target index if it’s visible in the heap.
+ */
+ if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+ {
+ tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+ num++;
+ }
+ }
+ }
+
+ return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+ Tuplestorestate *store;
+ BlockNumber prev_block_number;
+ OffsetNumber prev_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ * ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ * for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ * offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+ ReadStream *stream,
+ void *void_callback_private_data,
+ void *void_per_buffer_data
+ )
+{
+ bool should_free;
+ Datum datum;
+ BlockNumber result = InvalidBlockNumber;
+ int i = 0;
+
+ /*
+ * Retrieve the specialized callback state and the output buffer.
+ * callback_private_data keeps track of the previous block and offset
+ * from a prior invocation, if any.
+ */
+ ValidateIndexScanState *callback_private_data = void_callback_private_data;
+ OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+ /*
+ * If there is a "leftover" offset number from the previous invocation,
+ * it means we had switched to a new block in the middle of the last call.
+ * We place that leftover offset number into the buffer first.
+ */
+ if (callback_private_data->prev_offset_number != InvalidOffsetNumber)
+ {
+ Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+ /*
+ * 'result' is the block number to return. We set it to the block
+ * from the previous leftover offset.
+ */
+ result = callback_private_data->prev_block_number;
+ /* Place leftover offset number in the output buffer. */
+ per_buffer_data[i++] = callback_private_data->prev_offset_number;
+ /*
+ * Clear the leftover offset number so it won't be reused unless
+ * we encounter another block change.
+ */
+ callback_private_data->prev_offset_number = InvalidOffsetNumber;
+ }
+
+ /*
+ * Read from the tuplestore until we either run out of tuples or we
+ * encounter a block change. For each tuple:
+ *
+ * 1) Decode its block/offset from the Datum.
+ * 2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+ * initialize prev_block_number.
+ * 3) If the block number matches the current block, collect the offset.
+ * 4) If the block number differs, save that offset as leftover and break
+ * so that the next call can handle the new block.
+ */
+ while (tuplestore_getdatum(callback_private_data->store, true, &should_free, &datum))
+ {
+ BlockNumber next_block_number;
+ ItemPointerData next_data;
+
+ /* Decode the datum into an ItemPointer (block + offset). */
+ itemptr_decode(&next_data, DatumGetInt64(datum));
+ next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+ /*
+ * If we haven't set a block number yet this round, initialize it
+ * using the first tuple we read.
+ */
+ if (callback_private_data->prev_block_number == InvalidBlockNumber)
+ callback_private_data->prev_block_number = next_block_number;
+
+ /*
+ * Always set the result to be the "current" block number
+ * we are filling offsets for.
+ */
+ result = callback_private_data->prev_block_number;
+
+ /*
+ * If this tuple is from the same block, just store its offset
+ * in our per_buffer_data array.
+ */
+ if (next_block_number == callback_private_data->prev_block_number)
+ {
+ per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+ /* Free the datum if needed. */
+ if (should_free)
+ pfree(DatumGetPointer(datum));
+ }
+ else
+ {
+ /*
+ * If the block just changed, store the offset of the new block
+ * as leftover for the next invocation and break out.
+ */
+ callback_private_data->prev_block_number = next_block_number;
+ callback_private_data->prev_offset_number = ItemPointerGetOffsetNumber(&next_data);
+
+ /* Free the datum if needed. */
+ if (should_free)
+ pfree(DatumGetPointer(datum));
+
+ /* Break to let the next call handle the new block. */
+ break;
+ }
+ }
+
+ /*
+ * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+ */
+ per_buffer_data[i] = InvalidOffsetNumber;
+ return result;
+}
+
static void
heapam_index_validate_scan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
Snapshot snapshot,
- ValidateIndexState *state)
+ ValidateIndexState *state,
+ ValidateIndexState *auxState)
{
- TableScanDesc scan;
- HeapScanDesc hscan;
- HeapTuple heapTuple;
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
- ExprState *predicate;
- TupleTableSlot *slot;
- EState *estate;
- ExprContext *econtext;
- BlockNumber root_blkno = InvalidBlockNumber;
- OffsetNumber root_offsets[MaxHeapTuplesPerPage];
- bool in_index[MaxHeapTuplesPerPage];
- BlockNumber previous_blkno = InvalidBlockNumber;
-
- /* state variables for the merge */
- ItemPointer indexcursor = NULL;
- ItemPointerData decoded;
- bool tuplesort_empty = false;
+
+ TupleTableSlot *slot;
+ EState *estate;
+ ExprContext *econtext;
+ BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+ int64 num_to_check;
+ Tuplestorestate *tuples_for_check;
+ ValidateIndexScanState callback_private_data;
+
+ Buffer buf;
+ OffsetNumber* tuples;
+ ReadStream *read_stream;
+
+ /* Use 10% of memory for tuple store. */
+ int store_work_mem_part = maintenance_work_mem / 10;
+
+ /*
+ * Encode TIDs as int8 values for the sort, rather than directly sorting
+ * item pointers. This can be significantly faster, primarily because TID
+ * is a pass-by-reference type on all platforms, whereas int8 is
+ * pass-by-value on most platforms.
+ */
+ tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
/*
* sanity checks
*/
Assert(OidIsValid(indexRelation->rd_rel->relam));
- /*
- * Need an EState for evaluation of index expressions and partial-index
- * predicates. Also a slot to hold the current tuple.
- */
+ num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+ auxState->tuplesort,
+ tuples_for_check);
+
+ /* It is our responsibility to close tuple sort as fast as we can */
+ tuplesort_end(state->tuplesort);
+ tuplesort_end(auxState->tuplesort);
+
+ state->tuplesort = auxState->tuplesort = NULL;
+
estate = CreateExecutorState();
econtext = GetPerTupleExprContext(estate);
slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
- &TTSOpsHeapTuple);
+ &TTSOpsBufferHeapTuple);
/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;
- /* Set up execution state for predicate, if any. */
- predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+ callback_private_data.prev_block_number = InvalidBlockNumber;
+ callback_private_data.store = tuples_for_check;
+ callback_private_data.prev_offset_number = InvalidOffsetNumber;
- /*
- * Prepare for scan of the base relation. We need just those tuples
- * satisfying the passed-in reference snapshot. We must disable syncscan
- * here, because it's critical that we read from block zero forward to
- * match the sorted TIDs.
- */
- scan = table_beginscan_strat(heapRelation, /* relation */
- snapshot, /* snapshot */
- 0, /* number of keys */
- NULL, /* scan key */
- true, /* buffer access strategy OK */
- false); /* syncscan not OK */
- hscan = (HeapScanDesc) scan;
+ read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+ bstrategy,
+ heapRelation, MAIN_FORKNUM,
+ heapam_index_validate_scan_read_stream_next,
+ &callback_private_data,
+ (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
- pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
- hscan->rs_nblocks);
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
- /*
- * Scan all tuples matching the snapshot.
- */
- while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
{
- ItemPointer heapcursor = &heapTuple->t_self;
- ItemPointerData rootTuple;
- OffsetNumber root_offnum;
+ HeapTupleData heap_tuple_data[MaxHeapTuplesPerPage];
+ int i;
+ OffsetNumber off;
+ BlockNumber block_number;
CHECK_FOR_INTERRUPTS();
- state->htups += 1;
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ block_number = BufferGetBlockNumber(buf);
- if ((previous_blkno == InvalidBlockNumber) ||
- (hscan->rs_cblock != previous_blkno))
+ i = 0;
+ while ((off = tuples[i]) != InvalidOffsetNumber)
{
- pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- hscan->rs_cblock);
- previous_blkno = hscan->rs_cblock;
+ ItemPointerData tid;
+ bool all_dead, found;
+ ItemPointerSet(&tid, block_number, off);
+
+ found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+ &heap_tuple_data[i], &all_dead, true);
+ if (!found)
+ ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+ i++;
+ state->htups += 1;
}
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
- /*
- * As commented in table_index_build_scan, we should index heap-only
- * tuples under the TIDs of their root tuples; so when we advance onto
- * a new heap page, build a map of root item offsets on the page.
- *
- * This complicates merging against the tuplesort output: we will
- * visit the live tuples in order by their offsets, but the root
- * offsets that we need to compare against the index contents might be
- * ordered differently. So we might have to "look back" within the
- * tuplesort output, but only within the current page. We handle that
- * by keeping a bool array in_index[] showing all the
- * already-passed-over tuplesort output TIDs of the current page. We
- * clear that array here, when advancing onto a new heap page.
- */
- if (hscan->rs_cblock != root_blkno)
+ i = 0;
+ while ((off = tuples[i]) != InvalidOffsetNumber)
{
- Page page = BufferGetPage(hscan->rs_cbuf);
-
- LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
- heap_get_root_tuples(page, root_offsets);
- LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
- memset(in_index, 0, sizeof(in_index));
-
- root_blkno = hscan->rs_cblock;
- }
-
- /* Convert actual tuple TID to root TID */
- rootTuple = *heapcursor;
- root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
- if (HeapTupleIsHeapOnly(heapTuple))
- {
- root_offnum = root_offsets[root_offnum - 1];
- if (!OffsetNumberIsValid(root_offnum))
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
- ItemPointerGetBlockNumber(heapcursor),
- ItemPointerGetOffsetNumber(heapcursor),
- RelationGetRelationName(heapRelation))));
- ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
- }
-
- /*
- * "merge" by skipping through the index tuples until we find or pass
- * the current root tuple.
- */
- while (!tuplesort_empty &&
- (!indexcursor ||
- ItemPointerCompare(indexcursor, &rootTuple) < 0))
- {
- Datum ts_val;
- bool ts_isnull;
-
- if (indexcursor)
+ if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
{
+ ItemPointerData root_tid;
+ ItemPointerSet(&root_tid, block_number, off);
+
+ /* Reset the per-tuple memory context for the next fetch. */
+ MemoryContextReset(econtext->ecxt_per_tuple_memory);
+ ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+ /* Compute the key values and null flags for this tuple. */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
/*
- * Remember index items seen earlier on the current heap page
+ * Insert the tuple into the target index.
*/
- if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
- in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+ index_insert(indexRelation,
+ values,
+ isnull,
+ &root_tid, /* insert root tuple */
+ heapRelation,
+ indexInfo->ii_Unique ?
+ UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+ false,
+ indexInfo);
+
+ state->tups_inserted += 1;
}
- tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
- false, &ts_val, &ts_isnull,
- NULL);
- Assert(tuplesort_empty || !ts_isnull);
- if (!tuplesort_empty)
- {
- itemptr_decode(&decoded, DatumGetInt64(ts_val));
- indexcursor = &decoded;
- }
- else
- {
- /* Be tidy */
- indexcursor = NULL;
- }
+ pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+ i++;
}
- /*
- * If the tuplesort has overshot *and* we didn't see a match earlier,
- * then this tuple is missing from the index, so insert it.
- */
- if ((tuplesort_empty ||
- ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
- !in_index[root_offnum - 1])
- {
- MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
- /* Set up for predicate or expression evaluation */
- ExecStoreHeapTuple(heapTuple, slot, false);
-
- /*
- * In a partial index, discard tuples that don't satisfy the
- * predicate.
- */
- if (predicate != NULL)
- {
- if (!ExecQual(predicate, econtext))
- continue;
- }
-
- /*
- * For the current heap tuple, extract all the attributes we use
- * in this index, and note which are null. This also performs
- * evaluation of any expressions needed.
- */
- FormIndexDatum(indexInfo,
- slot,
- estate,
- values,
- isnull);
-
- /*
- * You'd think we should go ahead and build the index tuple here,
- * but some index AMs want to do further processing on the data
- * first. So pass the values[] and isnull[] arrays, instead.
- */
-
- /*
- * If the tuple is already committed dead, you might think we
- * could suppress uniqueness checking, but this is no longer true
- * in the presence of HOT, because the insert is actually a proxy
- * for a uniqueness check on the whole HOT-chain. That is, the
- * tuple we have here could be dead because it was already
- * HOT-updated, and if so the updating transaction will not have
- * thought it should insert index entries. The index AM will
- * check the whole HOT-chain and correctly detect a conflict if
- * there is one.
- */
-
- index_insert(indexRelation,
- values,
- isnull,
- &rootTuple,
- heapRelation,
- indexInfo->ii_Unique ?
- UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
- false,
- indexInfo);
-
- state->tups_inserted += 1;
- }
+ ReleaseBuffer(buf);
}
- table_endscan(scan);
-
ExecDropSingleTupleTableSlot(slot);
FreeExecutorState(estate);
+ read_stream_end(read_stream);
+ tuplestore_end(tuples_for_check);
+
+ FreeAccessStrategy(bstrategy);
+
/* These may have been pointing to the now-gone estate */
indexInfo->ii_ExpressionsState = NIL;
indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f5484c59d18..31f92b97580 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
* already exists.
* INDEX_CREATE_PARTITIONED:
* create a partitioned index (table must be partitioned)
+ * INDEX_CREATE_AUXILIARY:
+ * mark index as auxiliary index
* constr_flags: flags passed to index_constraint_create
* (only if INDEX_CREATE_ADD_CONSTRAINT is set)
* allow_system_table_mods: allow table to be a system catalog
* is_internal: if true, post creation hook for new index
* constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ * cases it should be equal to the persistence level of the table,
+ * auxiliary indexes are only exception here.
*
* Returns the OID of the created index.
*/
@@ -760,6 +765,7 @@ index_create(Relation heapRelation,
bool invalid = (flags & INDEX_CREATE_INVALID) != 0;
bool concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
bool partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+ bool auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
char relkind;
TransactionId relfrozenxid;
MultiXactId relminmxid;
@@ -785,7 +791,10 @@ index_create(Relation heapRelation,
namespaceId = RelationGetNamespace(heapRelation);
shared_relation = heapRelation->rd_rel->relisshared;
mapped_relation = RelationIsMapped(heapRelation);
- relpersistence = heapRelation->rd_rel->relpersistence;
+ if (auxiliary)
+ relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+ else
+ relpersistence = heapRelation->rd_rel->relpersistence;
/*
* check parameters
@@ -793,6 +802,11 @@ index_create(Relation heapRelation,
if (indexInfo->ii_NumIndexAttrs < 1)
elog(ERROR, "must index at least one column");
+ if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("user-defined indexes with STIR access method are not supported")));
+
if (!allow_system_table_mods &&
IsSystemRelation(heapRelation) &&
IsNormalProcessingMode())
@@ -1398,7 +1412,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
false, /* not ready for inserts */
true,
indexRelation->rd_indam->amsummarizing,
- oldInfo->ii_WithoutOverlaps);
+ oldInfo->ii_WithoutOverlaps,
+ false);
/*
* Extract the list of column names and the column numbers for the new
@@ -1473,6 +1488,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
return newIndexId;
}
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller. The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+ Oid tablespaceOid, const char *newName)
+{
+ Relation indexRelation;
+ IndexInfo *oldInfo,
+ *newInfo;
+ Oid newIndexId = InvalidOid;
+ HeapTuple indexTuple;
+
+ List *indexColNames = NIL;
+ List *indexExprs = NIL;
+ List *indexPreds = NIL;
+
+ Oid *auxOpclassIds;
+ int16 *auxColoptions;
+
+ indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+ /* The new index needs some information from the old index */
+ oldInfo = BuildIndexInfo(indexRelation);
+
+ /*
+ * Build of an auxiliary index with exclusion constraints is not
+ * supported.
+ */
+ if (oldInfo->ii_ExclusionOps != NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+ /* Get the array of class and column options IDs from index info */
+ indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+ if (!HeapTupleIsValid(indexTuple))
+ elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+ /*
+ * Fetch the list of expressions and predicates directly from the
+ * catalogs. This cannot rely on the information from IndexInfo of the
+ * old index as these have been flattened for the planner.
+ */
+ if (oldInfo->ii_Expressions != NIL)
+ {
+ Datum exprDatum;
+ char *exprString;
+
+ exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+ Anum_pg_index_indexprs);
+ exprString = TextDatumGetCString(exprDatum);
+ indexExprs = (List *) stringToNode(exprString);
+ pfree(exprString);
+ }
+ if (oldInfo->ii_Predicate != NIL)
+ {
+ Datum predDatum;
+ char *predString;
+
+ predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+ Anum_pg_index_indpred);
+ predString = TextDatumGetCString(predDatum);
+ indexPreds = (List *) stringToNode(predString);
+
+ /* Also convert to implicit-AND format */
+ indexPreds = make_ands_implicit((Expr *) indexPreds);
+ pfree(predString);
+ }
+
+ /*
+ * Build the index information for the new index. Note that rebuild of
+ * indexes with exclusion constraints is not supported, hence there is no
+ * need to fill all the ii_Exclusion* fields.
+ */
+ newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+ oldInfo->ii_NumIndexKeyAttrs,
+ STIR_AM_OID, /* special AM for aux indexes */
+ indexExprs,
+ indexPreds,
+ false, /* aux index are not unique */
+ oldInfo->ii_NullsNotDistinct,
+ false, /* not ready for inserts */
+ true,
+ false, /* aux are not summarizing */
+ false, /* aux are not without overlaps */
+ true /* auxiliary */);
+
+ /*
+ * Extract the list of column names and the column numbers for the new
+ * index information. All this information will be used for the index
+ * creation.
+ */
+ for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+ {
+ TupleDesc indexTupDesc = RelationGetDescr(indexRelation);
+ Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+ indexColNames = lappend(indexColNames, NameStr(att->attname));
+ newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+ }
+
+ auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+ auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+ /* Fill with "any ops" */
+ for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+ {
+ auxOpclassIds[i] = ANY_STIR_OPS_OID;
+ auxColoptions[i] = 0;
+ }
+
+ newIndexId = index_create(heapRelation,
+ newName,
+ InvalidOid, /* indexRelationId */
+ InvalidOid, /* parentIndexRelid */
+ InvalidOid, /* parentConstraintId */
+ InvalidRelFileNumber, /* relFileNumber */
+ newInfo,
+ indexColNames,
+ STIR_AM_OID,
+ tablespaceOid,
+ indexRelation->rd_indcollation,
+ auxOpclassIds,
+ NULL,
+ auxColoptions,
+ NULL,
+ (Datum) 0,
+ INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+ 0,
+ true, /* allow table to be a system catalog? */
+ false, /* is_internal? */
+ NULL);
+
+ /* Close the relations used and clean up */
+ index_close(indexRelation, NoLock);
+ ReleaseSysCache(indexTuple);
+
+ return newIndexId;
+}
+
/*
* index_concurrently_build
*
@@ -2453,7 +2616,8 @@ BuildIndexInfo(Relation index)
indexStruct->indisready,
false,
index->rd_indam->amsummarizing,
- indexStruct->indisexclusion && indexStruct->indisunique);
+ indexStruct->indisexclusion && indexStruct->indisunique,
+ index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
/* fill in attribute numbers */
for (i = 0; i < numAtts; i++)
@@ -2513,7 +2677,8 @@ BuildDummyIndexInfo(Relation index)
indexStruct->indisready,
false,
index->rd_indam->amsummarizing,
- indexStruct->indisexclusion && indexStruct->indisunique);
+ indexStruct->indisexclusion && indexStruct->indisunique,
+ index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
/* fill in attribute numbers */
for (i = 0; i < numAtts; i++)
@@ -3289,12 +3454,21 @@ IndexCheckExclusion(Relation heapRelation,
*
* We do a concurrent index build by first inserting the catalog entry for the
* index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
* Then we commit our transaction and start a new one, then we wait for all
* transactions that could have been modifying the table to terminate. Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
* honor its constraints on HOT updates; so while existing HOT-chains might
* be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it. We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After that, we build the auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
* index_build(), while holding a weak lock that allows concurrent
* insert/update/delete. Also, we index only tuples that are valid
* as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3304,14 +3478,17 @@ IndexCheckExclusion(Relation heapRelation,
* bogus unique-index failures due to concurrent UPDATEs (we might see
* different versions of the same row as being valid when we pass over them,
* if we used HeapTupleSatisfiesVacuum). This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
*
* Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third. Again we wait for all
+ * commit the third transaction and start a fourth. Again we wait for all
* transactions that could have been modifying the table to terminate. Now
* we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it. We then take a new reference snapshot
- * which is passed to validate_index(). Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
* to this snap, but are not in the index, must be added to the index.
* (Any tuples committed live after the snap will be inserted into the
* index by their originating transaction. Any tuples committed dead before
@@ -3319,12 +3496,14 @@ IndexCheckExclusion(Relation heapRelation,
* that might care about them before we mark the index valid.)
*
* validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
* ever say "delete it". (This should be faster than a plain indexscan;
* also, not all index AMs support full-index indexscan.) Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index. Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index. Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
*
* Building a unique index this way is tricky: we might try to insert a
* tuple that is already dead or is in process of being deleted, and we
@@ -3342,22 +3521,26 @@ IndexCheckExclusion(Relation heapRelation,
* not index). Then we mark the index "indisvalid" and commit. Subsequent
* transactions will be able to use it for queries.
*
- * Doing two full table scans is a brute-force strategy. We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm). However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
*/
void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
{
Relation heapRelation,
- indexRelation;
+ indexRelation,
+ auxIndexRelation;
IndexInfo *indexInfo;
- IndexVacuumInfo ivinfo;
- ValidateIndexState state;
+ IndexVacuumInfo ivinfo, auxivinfo;
+ ValidateIndexState state, auxState;
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ /* Use 80% of maintenance_work_mem to target index sorting and
+ * 10% rest for auxiliary.
+ *
+ * Rest 10% will be used for tuplestore later. */
+ int main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
+ int aux_work_mem_part = maintenance_work_mem / 10;
{
const int progress_index[] = {
@@ -3390,6 +3573,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
RestrictSearchPath();
indexRelation = index_open(indexId, RowExclusiveLock);
+ auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
/*
* Fetch info needed for index_insert. (You might think this should be
@@ -3414,15 +3598,49 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
ivinfo.strategy = NULL;
ivinfo.validate_index = true;
+ /*
+ * Copy all info to auxiliary info, changing only relation.
+ */
+ auxivinfo = ivinfo;
+ auxivinfo.index = auxIndexRelation;
+
/*
* Encode TIDs as int8 values for the sort, rather than directly sorting
* item pointers. This can be significantly faster, primarily because TID
* is a pass-by-reference type on all platforms, whereas int8 is
* pass-by-value on most platforms.
*/
+ auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+ InvalidOid, false,
+ aux_work_mem_part,
+ NULL, TUPLESORT_NONE);
+ auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+ (void) index_bulk_delete(&auxivinfo, NULL,
+ validate_index_callback, &auxState);
+ /* If aux index is empty, merge may be skipped */
+ if (auxState.itups == 0)
+ {
+ tuplesort_end(auxState.tuplesort);
+ auxState.tuplesort = NULL;
+
+ /* Roll back any GUC changes executed by index functions */
+ AtEOXact_GUC(false, save_nestlevel);
+
+ /* Restore userid and security context */
+ SetUserIdAndSecContext(save_userid, save_sec_context);
+
+ /* Close rels, but keep locks */
+ index_close(auxIndexRelation, NoLock);
+ index_close(indexRelation, NoLock);
+ table_close(heapRelation, NoLock);
+
+ return;
+ }
+
state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
InvalidOid, false,
- maintenance_work_mem,
+ (int) main_work_mem_part,
NULL, TUPLESORT_NONE);
state.htups = state.itups = state.tups_inserted = 0;
@@ -3445,27 +3663,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
pgstat_progress_update_multi_param(3, progress_index, progress_vals);
}
tuplesort_performsort(state.tuplesort);
+ tuplesort_performsort(auxState.tuplesort);
/*
- * Now scan the heap and "merge" it with the index
+ * Now merge both indexes
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+ PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
table_index_validate_scan(heapRelation,
indexRelation,
indexInfo,
snapshot,
- &state);
+ &state,
+ &auxState);
- /* Done with tuplesort object */
- tuplesort_end(state.tuplesort);
+ /* Tuple sort closed by table_index_validate_scan */
+ Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
/* Make sure to release resources cached in indexInfo (if needed). */
index_insert_cleanup(indexRelation, indexInfo);
elog(DEBUG2,
- "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
- state.htups, state.itups, state.tups_inserted);
+ "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+ " %.0f aux index tuples; inserted %.0f missing tuples",
+ state.htups, state.itups, auxState.itups, state.tups_inserted);
/* Roll back any GUC changes executed by index functions */
AtEOXact_GUC(false, save_nestlevel);
@@ -3474,6 +3695,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
SetUserIdAndSecContext(save_userid, save_sec_context);
/* Close rels, but keep locks */
+ index_close(auxIndexRelation, NoLock);
index_close(indexRelation, NoLock);
table_close(heapRelation, NoLock);
}
@@ -3534,6 +3756,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
Assert(!indexForm->indisvalid);
indexForm->indisvalid = true;
break;
+ case INDEX_DROP_CLEAR_READY:
+ /* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+ Assert(indexForm->indisready);
+ Assert(!indexForm->indisvalid);
+ indexForm->indisready = false;
+ break;
case INDEX_DROP_CLEAR_VALID:
/*
@@ -3805,6 +4033,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
indexInfo->ii_ExclusionStrats = NULL;
}
+ /* Auxiliary indexes are not allowed to be rebuilt */
+ if (indexInfo->ii_Auxiliary)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("reindex of auxiliary index \"%s\" not supported",
+ RelationGetRelationName(iRel))));
+
/* Suppress use of the target index while rebuilding it */
SetReindexProcessing(heapId, indexId);
@@ -4047,6 +4282,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
{
Oid indexOid = lfirst_oid(indexId);
Oid indexNamespaceId = get_rel_namespace(indexOid);
+ Oid indexAm = get_rel_relam(indexOid);
/*
* Skip any invalid indexes on a TOAST table. These can only be
@@ -4072,6 +4308,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
continue;
}
+ if (indexAm == STIR_AM_OID)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+ get_namespace_name(indexNamespaceId),
+ get_rel_name(indexOid))));
+ if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+ RemoveReindexPending(indexOid);
+ continue;
+ }
+
reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
persistence, params);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f1ed7b58f13..0dfa46a9b74 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1379,16 +1379,17 @@ CREATE VIEW pg_stat_progress_create_index AS
END AS command,
CASE S.param10 WHEN 0 THEN 'initializing'
WHEN 1 THEN 'waiting for writers before build'
- WHEN 2 THEN 'building index' ||
+ WHEN 2 THEN 'waiting for writers to use auxiliary index'
+ WHEN 3 THEN 'building index' ||
COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
'')
- WHEN 3 THEN 'waiting for writers before validation'
- WHEN 4 THEN 'index validation: scanning index'
- WHEN 5 THEN 'index validation: sorting tuples'
- WHEN 6 THEN 'index validation: scanning table'
- WHEN 7 THEN 'waiting for old snapshots'
- WHEN 8 THEN 'waiting for readers before marking dead'
- WHEN 9 THEN 'waiting for readers before dropping'
+ WHEN 4 THEN 'waiting for writers before validation'
+ WHEN 5 THEN 'index validation: scanning index'
+ WHEN 6 THEN 'index validation: sorting tuples'
+ WHEN 7 THEN 'index validation: merging indexes'
+ WHEN 8 THEN 'waiting for old snapshots'
+ WHEN 9 THEN 'waiting for readers before marking dead'
+ WHEN 10 THEN 'waiting for readers before dropping'
END as phase,
S.param4 AS lockers_total,
S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index cbd76066f74..dc4af0409df 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
bool isWithoutOverlaps)
{
bool isconstraint;
+ bool isauxiliary;
Oid *typeIds;
Oid *collationIds;
Oid *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
amcanorder = amRoutine->amcanorder;
amsummarizing = amRoutine->amsummarizing;
+ isauxiliary = accessMethodId == STIR_AM_OID;
/*
* Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
*/
indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
accessMethodId, NIL, NIL, false, false,
- false, false, amsummarizing, isWithoutOverlaps);
+ false, false, amsummarizing,
+ isWithoutOverlaps, isauxiliary);
typeIds = palloc_array(Oid, numberOfAttributes);
collationIds = palloc_array(Oid, numberOfAttributes);
opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -557,6 +560,7 @@ DefineIndex(ParseState *pstate,
{
bool concurrent;
char *indexRelationName;
+ char *auxIndexRelationName = NULL;
char *accessMethodName;
Oid *typeIds;
Oid *collationIds;
@@ -566,6 +570,7 @@ DefineIndex(ParseState *pstate,
Oid namespaceId;
Oid tablespaceId;
Oid createdConstraintId = InvalidOid;
+ Oid auxIndexRelationId = InvalidOid;
List *indexColNames;
List *allIndexParams;
Relation rel;
@@ -587,6 +592,7 @@ DefineIndex(ParseState *pstate,
int numberOfKeyAttributes;
TransactionId limitXmin;
ObjectAddress address;
+ ObjectAddress auxAddress;
LockRelId heaprelid;
LOCKTAG heaplocktag;
LOCKMODE lockmode;
@@ -837,6 +843,15 @@ DefineIndex(ParseState *pstate,
stmt->excludeOpNames,
stmt->primary,
stmt->isconstraint);
+ /*
+ * Select name for auxiliary index
+ */
+ if (concurrent)
+ auxIndexRelationName = ChooseRelationName(indexRelationName,
+ NULL,
+ "ccaux",
+ namespaceId,
+ false);
/*
* look up the access method, verify it can handle the requested features
@@ -931,7 +946,8 @@ DefineIndex(ParseState *pstate,
!concurrent,
concurrent,
amissummarizing,
- stmt->iswithoutoverlaps);
+ stmt->iswithoutoverlaps,
+ false);
typeIds = palloc_array(Oid, numberOfAttributes);
collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1601,6 +1617,16 @@ DefineIndex(ParseState *pstate,
return address;
}
+ /*
+ * In case of concurrent build - create auxiliary index record.
+ */
+ if (concurrent)
+ {
+ auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+ tablespaceId, auxIndexRelationName);
+ ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+ }
+
AtEOXact_GUC(false, root_save_nestlevel);
SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
@@ -1629,11 +1655,11 @@ DefineIndex(ParseState *pstate,
/*
* For a concurrent build, it's important to make the catalog entries
* visible to other transactions before we start to build the index. That
- * will prevent them from making incompatible HOT updates. The new index
- * will be marked not indisready and not indisvalid, so that no one else
- * tries to either insert into it or use it for queries.
+ * will prevent them from making incompatible HOT updates. New indexes
+ * (main and auxiliary) will be marked not indisready and not indisvalid,
+ * so that no one else tries to either insert into it or use it for queries.
*
- * We must commit our current transaction so that the index becomes
+ * We must commit our current transaction so that the indexes becomes
* visible; then start another. Note that all the data structures we just
* built are lost in the commit. The only data we keep past here are the
* relation IDs.
@@ -1643,7 +1669,7 @@ DefineIndex(ParseState *pstate,
* cannot block, even if someone else is waiting for access, because we
* already have the same lock within our transaction.
*
- * Note: we don't currently bother with a session lock on the index,
+ * Note: we don't currently bother with a session lock on the indexes,
* because there are no operations that could change its state while we
* hold lock on the parent table. This might need to change later.
*/
@@ -1682,7 +1708,7 @@ DefineIndex(ParseState *pstate,
* with the old list of indexes. Use ShareLock to consider running
* transactions that hold locks that permit writing to the table. Note we
* do not need to worry about xacts that open the table for writing after
- * this point; they will see the new index when they open it.
+ * this point; they will see the new indexes when they open it.
*
* Note: the reason we use actual lock acquisition here, rather than just
* checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1694,14 +1720,44 @@ DefineIndex(ParseState *pstate,
/*
* At this moment we are sure that there are no transactions with the
- * table open for write that don't have this new index in their list of
+ * table open for write that don't have this new indexes in their list of
* indexes. We have waited out all the existing transactions and any new
- * transaction will have the new index in its list, but the index is still
- * marked as "not-ready-for-inserts". The index is consulted while
+ * transaction will have both new indexes in its list, but indexes are still
+ * marked as "not-ready-for-inserts". The indexes are consulted while
* deciding HOT-safety though. This arrangement ensures that no new HOT
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
+ * Now call build on auxiliary index. Index will be created empty without
+ * any actual heap scan, but marked as "ready-for-inserts". The goal of
+ * that index is accumulate new tuples while main index is actually built.
+ */
+
+ /* Set ActiveSnapshot since functions in the indexes may need it */
+ PushActiveSnapshot(GetTransactionSnapshot());
+
+ index_concurrently_build(tableId, auxIndexRelationId);
+ /* we can do away with our snapshot */
+ PopActiveSnapshot();
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /* Tell concurrent index builds to ignore us, if index qualifies */
+ if (safe_index)
+ set_indexsafe_procflags();
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_2);
+ /*
+ * Now we need to ensure are no transactions with the with auxiliary index
+ * marked as "not-ready-for-inserts".
+ */
+ WaitForLockers(heaplocktag, ShareLock, true);
+
+ /*
+ * At this moment we are sure that all new tuples in table are inserted into
+ * the auxiliary index. Now it is time to build the target index itself.
+ *
* We now take a new snapshot, and build the index using all tuples that
* are visible in this snapshot. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
@@ -1736,9 +1792,28 @@ DefineIndex(ParseState *pstate,
* the index marked as read-only for updates.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_2);
+ PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForLockers(heaplocktag, ShareLock, true);
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
+ /*
+ * Now target index is marked as "ready" for all transactions. So, auxiliary
+ * index is no longer needed. So, start removing process by reverting "ready"
+ * flag.
+ */
+ index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+ PopActiveSnapshot();
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ /* Tell concurrent index builds to ignore us, if index qualifies */
+ if (safe_index)
+ set_indexsafe_procflags();
+
/*
* Now take the "reference snapshot" that will be used by validate_index()
* to filter candidate tuples. Beware! There might still be snapshots in
@@ -1756,24 +1831,14 @@ DefineIndex(ParseState *pstate,
*/
snapshot = RegisterSnapshot(GetTransactionSnapshot());
PushActiveSnapshot(snapshot);
-
/*
- * Scan the index and the heap, insert any missing index entries.
- */
- validate_index(tableId, indexRelationId, snapshot);
-
- /*
- * Drop the reference snapshot. We must do this before waiting out other
- * snapshot holders, else we will deadlock against other processes also
- * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
- * they must wait for. But first, save the snapshot's xmin to use as
- * limitXmin for GetCurrentVirtualXIDs().
+ * Merge content of auxiliary and target indexes - insert any missing index entries.
*/
+ validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
limitXmin = snapshot->xmin;
PopActiveSnapshot();
UnregisterSnapshot(snapshot);
-
/*
* The snapshot subsystem could still contain registered snapshots that
* are holding back our process's advertised xmin; in particular, if
@@ -1800,7 +1865,7 @@ DefineIndex(ParseState *pstate,
*/
INJECTION_POINT("define-index-before-set-valid", NULL);
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_3);
+ PROGRESS_CREATEIDX_PHASE_WAIT_4);
WaitForOlderSnapshots(limitXmin, true);
/*
@@ -1825,6 +1890,53 @@ DefineIndex(ParseState *pstate,
* to replan; so relcache flush on the index itself was sufficient.)
*/
CacheInvalidateRelcacheByRelid(heaprelid.relId);
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /* Tell concurrent index builds to ignore us, if index qualifies */
+ if (safe_index)
+ set_indexsafe_procflags();
+
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_5);
+ /* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+ WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
+ /* Now it is time to mark auxiliary index as dead */
+ index_concurrently_set_dead(tableId, auxIndexRelationId);
+ PopActiveSnapshot();
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /* Tell concurrent index builds to ignore us, if index qualifies */
+ if (safe_index)
+ set_indexsafe_procflags();
+
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_6);
+ /* Now wait for all transaction to ignore auxiliary because it is dead */
+ WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+ CommitTransactionCommand();
+ StartTransactionCommand();
+
+ /*
+ * Drop auxiliary index.
+ *
+ * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+ * right lock level.
+ */
+ performDeletion(&auxAddress, DROP_RESTRICT,
+ PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
/*
* Last thing to do is release the session-level lock on the parent table.
@@ -3596,6 +3708,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
typedef struct ReindexIndexInfo
{
Oid indexId;
+ Oid auxIndexId;
Oid tableId;
Oid amId;
bool safe; /* for set_indexsafe_procflags */
@@ -3701,8 +3814,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
Oid cellOid = lfirst_oid(lc);
Relation indexRelation = index_open(cellOid,
ShareUpdateExclusiveLock);
+ IndexInfo* indexInfo = BuildDummyIndexInfo(indexRelation);
- if (!indexRelation->rd_index->indisvalid)
+
+ if (indexInfo->ii_Auxiliary)
+ ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+ get_namespace_name(get_rel_namespace(cellOid)),
+ get_rel_name(cellOid))));
+ else if (!indexRelation->rd_index->indisvalid)
ereport(WARNING,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3754,8 +3874,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
Oid cellOid = lfirst_oid(lc2);
Relation indexRelation = index_open(cellOid,
ShareUpdateExclusiveLock);
+ IndexInfo* indexInfo = BuildDummyIndexInfo(indexRelation);
- if (!indexRelation->rd_index->indisvalid)
+ if (indexInfo->ii_Auxiliary)
+ ereport(WARNING,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+ get_namespace_name(get_rel_namespace(cellOid)),
+ get_rel_name(cellOid))));
+ else if (!indexRelation->rd_index->indisvalid)
ereport(WARNING,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3816,6 +3943,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot reindex invalid index on TOAST table")));
+ /* Auxiliary indexes are not allowed to be rebuilt */
+ if (get_rel_relam(relationOid) == STIR_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("reindex of auxiliary index \"%s\" not supported",
+ get_rel_name(relationOid))));
+
/*
* Check if parent relation can be locked and if it exists,
* this needs to be done at this stage as the list of indexes
@@ -3919,15 +4053,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
foreach(lc, indexIds)
{
char *concurrentName;
+ char *auxConcurrentName;
ReindexIndexInfo *idx = lfirst(lc);
ReindexIndexInfo *newidx;
Oid newIndexId;
+ Oid auxIndexId;
Relation indexRel;
Relation heapRel;
Oid save_userid;
int save_sec_context;
int save_nestlevel;
Relation newIndexRel;
+ Relation auxIndexRel;
LockRelId *lockrelid;
Oid tablespaceid;
@@ -3978,6 +4115,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
"ccnew",
get_rel_namespace(indexRel->rd_index->indrelid),
false);
+ auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+ NULL,
+ "ccaux",
+ get_rel_namespace(indexRel->rd_index->indrelid),
+ false);
/* Choose the new tablespace, indexes of toast tables are not moved */
if (OidIsValid(params->tablespaceOid) &&
@@ -3991,12 +4133,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
idx->indexId,
tablespaceid,
concurrentName);
+ auxIndexId = index_concurrently_create_aux(heapRel,
+ newIndexId,
+ tablespaceid,
+ auxConcurrentName);
/*
* Now open the relation of the new index, a session-level lock is
* also needed on it.
*/
newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+ auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
/*
* Save the list of OIDs and locks in private context
@@ -4005,6 +4152,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
newidx = palloc_object(ReindexIndexInfo);
newidx->indexId = newIndexId;
+ newidx->auxIndexId = auxIndexId;
newidx->safe = idx->safe;
newidx->tableId = idx->tableId;
newidx->amId = idx->amId;
@@ -4023,10 +4171,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
lockrelid = palloc_object(LockRelId);
*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
relationLocks = lappend(relationLocks, lockrelid);
+ lockrelid = palloc_object(LockRelId);
+ *lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+ relationLocks = lappend(relationLocks, lockrelid);
MemoryContextSwitchTo(oldcontext);
index_close(indexRel, NoLock);
+ index_close(auxIndexRel, NoLock);
index_close(newIndexRel, NoLock);
/* Roll back any GUC changes executed by index functions */
@@ -4107,13 +4259,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* doing that, wait until no running transactions could have the table of
* the index open with the old list of indexes. See "phase 2" in
* DefineIndex() for more details.
+ */
+
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_1);
+ WaitForLockersMultiple(lockTags, ShareLock, true);
+ CommitTransactionCommand();
+
+ /*
+ * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+ */
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+
+ StartTransactionCommand();
+
+ /*
+ * Check for user-requested abort. This is inside a transaction so as
+ * xact.c does not issue a useless WARNING, and ensures that
+ * session-level locks are cleaned up on abort.
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ /* Tell concurrent indexing to ignore us, if index qualifies */
+ if (newidx->safe)
+ set_indexsafe_procflags();
+
+ /* Set ActiveSnapshot since functions in the indexes may need it */
+ PushActiveSnapshot(GetTransactionSnapshot());
+
+ /* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+ index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ }
+
+ StartTransactionCommand();
+
+ /*
+ * Because we don't take a snapshot in this transaction, there's no need
+ * to set the PROC_IN_SAFE_IC flag here.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_1);
+ PROGRESS_CREATEIDX_PHASE_WAIT_2);
+ /*
+ * Wait until all auxiliary indexes are taken into account by all
+ * transactions.
+ */
WaitForLockersMultiple(lockTags, ShareLock, true);
CommitTransactionCommand();
+ /* Now it is time to perform target index build. */
foreach(lc, newIndexIds)
{
ReindexIndexInfo *newidx = lfirst(lc);
@@ -4160,6 +4359,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* need to set the PROC_IN_SAFE_IC flag here.
*/
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_3);
+ WaitForLockersMultiple(lockTags, ShareLock, true);
+ CommitTransactionCommand();
+
+ /*
+ * At this moment all target indexes are marked as "ready-to-insert". So,
+ * we are free to start process of dropping auxiliary indexes.
+ */
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+ StartTransactionCommand();
+ /*
+ * Check for user-requested abort. This is inside a transaction so as
+ * xact.c does not issue a useless WARNING, and ensures that
+ * session-level locks are cleaned up on abort.
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ /* Tell concurrent indexing to ignore us, if index qualifies */
+ if (newidx->safe)
+ set_indexsafe_procflags();
+
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
+ index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+ PopActiveSnapshot();
+
+ CommitTransactionCommand();
+ }
+
/*
* Phase 3 of REINDEX CONCURRENTLY
*
@@ -4167,12 +4401,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* were created during the previous phase. See "phase 3" in DefineIndex()
* for more details.
*/
-
- pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_2);
- WaitForLockersMultiple(lockTags, ShareLock, true);
- CommitTransactionCommand();
-
foreach(lc, newIndexIds)
{
ReindexIndexInfo *newidx = lfirst(lc);
@@ -4210,7 +4438,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
progress_vals[3] = newidx->amId;
pgstat_progress_update_multi_param(4, progress_index, progress_vals);
- validate_index(newidx->tableId, newidx->indexId, snapshot);
+ validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
/*
* We can now do away with our active snapshot, we still need to save
@@ -4239,7 +4467,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
* there's no need to set the PROC_IN_SAFE_IC flag here.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_3);
+ PROGRESS_CREATEIDX_PHASE_WAIT_4);
WaitForOlderSnapshots(limitXmin, true);
CommitTransactionCommand();
@@ -4330,14 +4558,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/*
* Phase 5 of REINDEX CONCURRENTLY
*
- * Mark the old indexes as dead. First we must wait until no running
- * transaction could be using the index for a query. See also
+ * Mark the old and auxiliary indexes as dead. First we must wait until no
+ * running transaction could be using the index for a query. See also
* index_drop() for more details.
*/
INJECTION_POINT("reindex-relation-concurrently-before-set-dead", NULL);
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_4);
+ PROGRESS_CREATEIDX_PHASE_WAIT_5);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
foreach(lc, indexIds)
@@ -4362,6 +4590,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
PopActiveSnapshot();
}
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *newidx = lfirst(lc);
+
+ /*
+ * Check for user-requested abort. This is inside a transaction so as
+ * xact.c does not issue a useless WARNING, and ensures that
+ * session-level locks are cleaned up on abort.
+ */
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Updating pg_index might involve TOAST table access, so ensure we
+ * have a valid snapshot.
+ */
+ PushActiveSnapshot(GetTransactionSnapshot());
+
+ index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+ PopActiveSnapshot();
+ }
+
/* Commit this transaction to make the updates visible. */
CommitTransactionCommand();
StartTransactionCommand();
@@ -4375,11 +4625,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/*
* Phase 6 of REINDEX CONCURRENTLY
*
- * Drop the old indexes.
+ * Drop the old and auxiliary indexes.
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
- PROGRESS_CREATEIDX_PHASE_WAIT_5);
+ PROGRESS_CREATEIDX_PHASE_WAIT_6);
WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
PushActiveSnapshot(GetTransactionSnapshot());
@@ -4399,6 +4649,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
add_exact_object_address(&object, objects);
}
+ foreach(lc, newIndexIds)
+ {
+ ReindexIndexInfo *idx = lfirst(lc);
+ ObjectAddress object;
+
+ object.classId = RelationRelationId;
+ object.objectId = idx->auxIndexId;
+ object.objectSubId = 0;
+
+ add_exact_object_address(&object, objects);
+ }
+
/*
* Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
* right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 5359dab1176..84f7cf9824e 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
List *predicates, bool unique, bool nulls_not_distinct,
bool isready, bool concurrent, bool summarizing,
- bool withoutoverlaps)
+ bool withoutoverlaps, bool auxiliary)
{
IndexInfo *n = makeNode(IndexInfo);
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
n->ii_Concurrent = concurrent;
n->ii_Summarizing = summarizing;
n->ii_WithoutOverlaps = withoutoverlaps;
+ n->ii_Auxiliary = auxiliary;
/* summarizing indexes cannot contain non-key attributes */
Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
/* initialize index-build state to default */
n->ii_BrokenHotChain = false;
n->ii_ParallelWorkers = 0;
- n->ii_Auxiliary = false;
/* set up for possible use by index AM */
n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06084752245..1a997537800 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -705,7 +705,8 @@ typedef struct TableAmRoutine
Relation index_rel,
IndexInfo *index_info,
Snapshot snapshot,
- ValidateIndexState *state);
+ ValidateIndexState *state,
+ ValidateIndexState *aux_state);
/* ------------------------------------------------------------------------
@@ -1824,19 +1825,24 @@ table_index_build_range_scan(Relation table_rel,
* table_index_validate_scan - second table scan for concurrent index build
*
* See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
*/
static inline void
table_index_validate_scan(Relation table_rel,
Relation index_rel,
IndexInfo *index_info,
Snapshot snapshot,
- ValidateIndexState *state)
+ ValidateIndexState *state,
+ ValidateIndexState *auxstate)
{
table_rel->rd_tableam->index_validate_scan(table_rel,
index_rel,
index_info,
snapshot,
- state);
+ state,
+ auxstate);
}
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 36b70689254..727993d1a5a 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -31,6 +31,7 @@ typedef enum
{
INDEX_CREATE_SET_READY,
INDEX_CREATE_SET_VALID,
+ INDEX_DROP_CLEAR_READY,
INDEX_DROP_CLEAR_VALID,
INDEX_DROP_SET_DEAD,
} IndexStateFlagsAction;
@@ -71,6 +72,7 @@ extern void index_check_primary_key(Relation heapRel,
#define INDEX_CREATE_IF_NOT_EXISTS (1 << 4)
#define INDEX_CREATE_PARTITIONED (1 << 5)
#define INDEX_CREATE_INVALID (1 << 6)
+#define INDEX_CREATE_AUXILIARY (1 << 7)
extern Oid index_create(Relation heapRelation,
const char *indexRelationName,
@@ -106,6 +108,11 @@ extern Oid index_concurrently_create_copy(Relation heapRelation,
Oid tablespaceOid,
const char *newName);
+extern Oid index_concurrently_create_aux(Relation heapRelation,
+ Oid mainIndexId,
+ Oid tablespaceOid,
+ const char *newName);
+
extern void index_concurrently_build(Oid heapRelationId,
Oid indexRelationId);
@@ -151,7 +158,7 @@ extern void index_build(Relation heapRelation,
bool isreindex,
bool parallel);
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 9c40772706c..8e5f98c6fad 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -117,14 +117,15 @@
/* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
#define PROGRESS_CREATEIDX_PHASE_WAIT_1 1
-#define PROGRESS_CREATEIDX_PHASE_BUILD 2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2 3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN 4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT 5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN 6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3 7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2 2
+#define PROGRESS_CREATEIDX_PHASE_BUILD 3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3 4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN 5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT 6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE 7
#define PROGRESS_CREATEIDX_PHASE_WAIT_4 8
#define PROGRESS_CREATEIDX_PHASE_WAIT_5 9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6 10
/*
* Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index bf54d39feb0..cd7f1eb0592 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
List *expressions, List *predicates,
bool unique, bool nulls_not_distinct,
bool isready, bool concurrent,
- bool summarizing, bool withoutoverlaps);
+ bool summarizing, bool withoutoverlaps,
+ bool auxiliary);
extern Node *makeStringConst(char *str, int location);
extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 55538c4c41e..d1723f47e89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1437,6 +1437,7 @@ DETAIL: Key (f1)=(b) already exists.
CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
ERROR: could not create unique index "concur_index3"
DETAIL: Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
-- test that expression indexes and partial indexes work concurrently
CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3211,6 +3212,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
ERROR: could not create unique index "concur_reindex_ind5"
DETAIL: Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
-- Reindexing concurrently this index fails with the same failure.
-- The extra index created is itself invalid, and can be dropped.
REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3223,8 +3225,10 @@ DETAIL: Key (c1)=(1) is duplicated.
c1 | integer | | |
Indexes:
"concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+ "concur_reindex_ind5_ccaux" stir (c1) INVALID
"concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
+DROP INDEX concur_reindex_ind5_ccaux;
DROP INDEX concur_reindex_ind5_ccnew;
-- This makes the previous failure go away, so the index can become valid.
DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3252,6 +3256,44 @@ Indexes:
"concur_reindex_ind5" UNIQUE, btree (c1)
DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR: could not create unique index "aux_index_ind6"
+DETAIL: Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+ Table "public.aux_index_tab5"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "aux_index_ind6" UNIQUE, btree (c1) INVALID
+ "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR: reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR: reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR: relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR: could not create unique index "aux_index_ind6"
+DETAIL: Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING: skipping reindex of invalid index "public.aux_index_ind6"
+HINT: Use DROP INDEX or REINDEX INDEX.
+WARNING: skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE: table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
-- Check handling of indexes with expressions and predicates. The
-- definitions of the rebuilt indexes should match the original
-- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index dc629928c8f..9b06ddc87a2 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
--------------------------------+------------+-----------------------+-------------------------------
parted_isvalid_idx | f | parted_isvalid_tab |
parted_isvalid_idx_11 | f | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux | f | parted_isvalid_tab_11 |
parted_isvalid_tab_12_expr_idx | t | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
parted_isvalid_tab_1_expr_idx | f | parted_isvalid_tab_1 | parted_isvalid_idx
parted_isvalid_tab_2_expr_idx | t | parted_isvalid_tab_2 | parted_isvalid_idx
-(5 rows)
+(6 rows)
drop table parted_isvalid_tab;
-- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 32bea58db2c..b80d5c2ed65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2058,14 +2058,15 @@ pg_stat_progress_create_index| SELECT s.pid,
CASE s.param10
WHEN 0 THEN 'initializing'::text
WHEN 1 THEN 'waiting for writers before build'::text
- WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
- WHEN 3 THEN 'waiting for writers before validation'::text
- WHEN 4 THEN 'index validation: scanning index'::text
- WHEN 5 THEN 'index validation: sorting tuples'::text
- WHEN 6 THEN 'index validation: scanning table'::text
- WHEN 7 THEN 'waiting for old snapshots'::text
- WHEN 8 THEN 'waiting for readers before marking dead'::text
- WHEN 9 THEN 'waiting for readers before dropping'::text
+ WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+ WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+ WHEN 4 THEN 'waiting for writers before validation'::text
+ WHEN 5 THEN 'index validation: scanning index'::text
+ WHEN 6 THEN 'index validation: sorting tuples'::text
+ WHEN 7 THEN 'index validation: merging indexes'::text
+ WHEN 8 THEN 'waiting for old snapshots'::text
+ WHEN 9 THEN 'waiting for readers before marking dead'::text
+ WHEN 10 THEN 'waiting for readers before dropping'::text
ELSE NULL::text
END AS phase,
s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 82e4062a215..c2c1b031527 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -503,6 +503,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
INSERT INTO concur_heap VALUES ('b','x');
-- check if constraint is enforced properly at build time
CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
-- test that expression indexes and partial indexes work concurrently
CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1315,10 +1316,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
-- This trick creates an invalid index.
CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
-- Reindexing concurrently this index fails with the same failure.
-- The extra index created is itself invalid, and can be dropped.
REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
\d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
DROP INDEX concur_reindex_ind5_ccnew;
-- This makes the previous failure go away, so the index can become valid.
DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1330,6 +1333,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
\d concur_reindex_tab4
DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
-- Check handling of indexes with expressions and predicates. The
-- definitions of the rebuilt indexes should match the original
-- definitions.
--
2.43.0
[application/octet-stream] v31-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (30.9K, 5-v31-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
download | inline diff:
From 40c6d37815c25c63d3ec1e0b4e119e193795fa02 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v31 5/7] Track and drop auxiliary indexes in DROP/REINDEX
During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).
This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
doc/src/sgml/ref/create_index.sgml | 14 ++-
doc/src/sgml/ref/reindex.sgml | 8 +-
src/backend/catalog/dependency.c | 2 +-
src/backend/catalog/index.c | 71 ++++++++++----
src/backend/catalog/pg_depend.c | 58 ++++++++++++
src/backend/catalog/toasting.c | 1 +
src/backend/commands/indexcmds.c | 37 +++++++-
src/backend/commands/tablecmds.c | 52 +++++++++-
src/backend/nodes/makefuncs.c | 3 +-
src/include/catalog/dependency.h | 1 +
src/include/nodes/execnodes.h | 2 +
src/include/nodes/makefuncs.h | 2 +-
src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
src/test/regress/sql/create_index.sql | 57 ++++++++++-
14 files changed, 371 insertions(+), 42 deletions(-)
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 12c88587a79..7f751453317 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
"idx_ccaux" stir (col) INVALID
</programlisting>
- The recommended recovery
- method in such cases is to drop these indexes and try again to perform
- <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
- to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+ The recommended recovery method in such cases is to drop the index with
+ <command>DROP INDEX</command>. The auxiliary index (suffixed with
+ <literal>_ccaux</literal>) will be automatically dropped when the main
+ index is dropped. After dropping the indexes, you can try again to perform
+ <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+ rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+ which will also handle cleanup of any invalid auxiliary indexes.)
+ If the only invalid index is one suffixed <literal>_ccaux</literal>,
+ recommended recovery method is just <literal>DROP INDEX</literal>
+ for that index.
</para>
<para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 9e0248261ae..54f7b36efa2 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -476,11 +476,15 @@ Indexes:
<literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
index created during the concurrent operation, and the recommended
recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
- then attempt <command>REINDEX CONCURRENTLY</command> again.
+ then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+ (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+ along with its main index.
If the invalid index is instead suffixed <literal>_ccold</literal>,
it corresponds to the original index which could not be dropped;
the recommended recovery method is to just drop said index, since the
- rebuild proper has been successful.
+ rebuild proper has been successful. If the only
+ invalid index is one suffixed <literal>_ccaux</literal>, recommended
+ recovery method is just <literal>DROP INDEX</literal> for that index.
A nonzero number may be appended to the suffix of the invalid index
names to keep them unique, like <literal>_ccnew1</literal>,
<literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index fdb8e67e1f5..c6941fb19d1 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -292,7 +292,7 @@ performDeletion(const ObjectAddress *object,
* Acquire deletion lock on the target object. (Ideally the caller has
* done this already, but many places are sloppy about it.)
*/
- AcquireDeletionLock(object, 0);
+ AcquireDeletionLock(object, flags);
/*
* Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 31f92b97580..4b6a0f76c81 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -776,6 +776,8 @@ index_create(Relation heapRelation,
((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
/* partitioned indexes must never be "built" by themselves */
Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+ /* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+ Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1181,6 +1183,15 @@ index_create(Relation heapRelation,
recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
}
+ /*
+ * Record dependency on the main index in case of auxiliary index.
+ */
+ if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+ {
+ ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+ recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+ }
+
/* placeholder for normal dependencies */
addrs = new_object_addresses();
@@ -1413,7 +1424,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
true,
indexRelation->rd_indam->amsummarizing,
oldInfo->ii_WithoutOverlaps,
- false);
+ false,
+ InvalidOid);
/*
* Extract the list of column names and the column numbers for the new
@@ -1581,7 +1593,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
true,
false, /* aux are not summarizing */
false, /* aux are not without overlaps */
- true /* auxiliary */);
+ true /* auxiliary */,
+ mainIndexId /* auxiliaryForIndexId */);
/*
* Extract the list of column names and the column numbers for the new
@@ -2617,7 +2630,8 @@ BuildIndexInfo(Relation index)
false,
index->rd_indam->amsummarizing,
indexStruct->indisexclusion && indexStruct->indisunique,
- index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+ index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+ InvalidOid /* auxiliary_for_index_id is set only during build */);
/* fill in attribute numbers */
for (i = 0; i < numAtts; i++)
@@ -2678,7 +2692,8 @@ BuildDummyIndexInfo(Relation index)
false,
index->rd_indam->amsummarizing,
indexStruct->indisexclusion && indexStruct->indisunique,
- index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+ index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+ InvalidOid);
/* fill in attribute numbers */
for (i = 0; i < numAtts; i++)
@@ -3843,6 +3858,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
heapRelation;
Oid heapId;
Oid save_userid;
+ Oid junkAuxIndexId;
int save_sec_context;
int save_nestlevel;
IndexInfo *indexInfo;
@@ -3899,6 +3915,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
}
+ /* Check for the auxiliary index for that index, it needs to be dropped */
+ junkAuxIndexId = get_auxiliary_index(indexId);
+ if (OidIsValid(junkAuxIndexId))
+ {
+ ObjectAddress object;
+ object.classId = RelationRelationId;
+ object.objectId = junkAuxIndexId;
+ object.objectSubId = 0;
+ performDeletion(&object, DROP_RESTRICT,
+ PERFORM_DELETION_INTERNAL |
+ PERFORM_DELETION_QUIETLY);
+ }
+
/*
* Open the target index relation and get an exclusive lock on it, to
* ensure that no one else is touching this particular index.
@@ -4187,7 +4216,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
{
Relation rel;
Oid toast_relid;
- List *indexIds;
+ List *indexIds,
+ *auxIndexIds = NIL;
char persistence;
bool result = false;
ListCell *indexId;
@@ -4276,13 +4306,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
else
persistence = rel->rd_rel->relpersistence;
+ foreach(indexId, indexIds)
+ {
+ Oid indexOid = lfirst_oid(indexId);
+ Oid indexAm = get_rel_relam(indexOid);
+
+ /* All STIR indexes are auxiliary indexes */
+ if (indexAm == STIR_AM_OID)
+ {
+ if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+ RemoveReindexPending(indexOid);
+ auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+ }
+ }
+
/* Reindex all the indexes. */
i = 1;
foreach(indexId, indexIds)
{
Oid indexOid = lfirst_oid(indexId);
Oid indexNamespaceId = get_rel_namespace(indexOid);
- Oid indexAm = get_rel_relam(indexOid);
+
+ /* Auxiliary indexes are going to be dropped during main index rebuild */
+ if (list_member_oid(auxIndexIds, indexOid))
+ continue;
/*
* Skip any invalid indexes on a TOAST table. These can only be
@@ -4308,18 +4355,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
continue;
}
- if (indexAm == STIR_AM_OID)
- {
- ereport(WARNING,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("skipping reindex of auxiliary index \"%s.%s\"",
- get_namespace_name(indexNamespaceId),
- get_rel_name(indexOid))));
- if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
- RemoveReindexPending(indexOid);
- continue;
- }
-
reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
persistence, params);
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 07c2d41c189..7e0e29bdb5b 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -20,6 +20,7 @@
#include "catalog/catalog.h"
#include "catalog/dependency.h"
#include "catalog/indexing.h"
+#include "catalog/pg_am_d.h"
#include "catalog/pg_constraint.h"
#include "catalog/pg_depend.h"
#include "catalog/pg_extension.h"
@@ -1108,6 +1109,63 @@ get_index_constraint(Oid indexId)
return constraintId;
}
+/*
+ * get_auxiliary_index
+ * Given the OID of an index, return the OID of its auxiliary
+ * index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+ Oid auxiliaryIndexOid = InvalidOid;
+ Relation depRel;
+ ScanKeyData key[3];
+ SysScanDesc scan;
+ HeapTuple tup;
+
+ /* Search the dependency table for the index */
+ depRel = table_open(DependRelationId, AccessShareLock);
+
+ ScanKeyInit(&key[0],
+ Anum_pg_depend_refclassid,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(RelationRelationId));
+ ScanKeyInit(&key[1],
+ Anum_pg_depend_refobjid,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(indexId));
+ ScanKeyInit(&key[2],
+ Anum_pg_depend_refobjsubid,
+ BTEqualStrategyNumber, F_INT4EQ,
+ Int32GetDatum(0));
+
+ scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+ NULL, 3, key);
+
+ while (HeapTupleIsValid(tup = systable_getnext(scan)))
+ {
+ Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+ /*
+ * We assume AUTO dependency on index with rel_kind
+ * of RELKIND_INDEX and AM eq STIR is that we are looking for.
+ */
+ if (deprec->classid == RelationRelationId &&
+ (deprec->deptype == DEPENDENCY_AUTO) &&
+ get_rel_relkind(deprec->objid) == RELKIND_INDEX &&
+ get_rel_relam(deprec->objid) == STIR_AM_OID)
+ {
+ auxiliaryIndexOid = deprec->objid;
+ break;
+ }
+ }
+
+ systable_endscan(scan);
+ table_close(depRel, AccessShareLock);
+
+ return auxiliaryIndexOid;
+}
+
/*
* get_index_ref_constraints
* Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index c33e43df1ec..b16eac0357f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -314,6 +314,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
indexInfo->ii_Am = BTREE_AM_OID;
indexInfo->ii_AmCache = NULL;
indexInfo->ii_Auxiliary = false;
+ indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
indexInfo->ii_Context = CurrentMemoryContext;
collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index dc4af0409df..b430d4a5b34 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -247,7 +247,7 @@ CheckIndexCompatible(Oid oldId,
indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
accessMethodId, NIL, NIL, false, false,
false, false, amsummarizing,
- isWithoutOverlaps, isauxiliary);
+ isWithoutOverlaps, isauxiliary, InvalidOid);
typeIds = palloc_array(Oid, numberOfAttributes);
collationIds = palloc_array(Oid, numberOfAttributes);
opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -947,7 +947,8 @@ DefineIndex(ParseState *pstate,
concurrent,
amissummarizing,
stmt->iswithoutoverlaps,
- false);
+ false,
+ InvalidOid);
typeIds = palloc_array(Oid, numberOfAttributes);
collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3709,6 +3710,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
{
Oid indexId;
Oid auxIndexId;
+ Oid junkAuxIndexId;
Oid tableId;
Oid amId;
bool safe; /* for set_indexsafe_procflags */
@@ -4058,6 +4060,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
ReindexIndexInfo *newidx;
Oid newIndexId;
Oid auxIndexId;
+ Oid junkAuxIndexId;
Relation indexRel;
Relation heapRel;
Oid save_userid;
@@ -4065,6 +4068,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
int save_nestlevel;
Relation newIndexRel;
Relation auxIndexRel;
+ Relation junkAuxIndexRel;
LockRelId *lockrelid;
Oid tablespaceid;
@@ -4138,12 +4142,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
tablespaceid,
auxConcurrentName);
+ /* Search for auxiliary index for reindexed index, to drop it */
+ junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
/*
* Now open the relation of the new index, a session-level lock is
* also needed on it.
*/
newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+ if (OidIsValid(junkAuxIndexId))
+ junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
/*
* Save the list of OIDs and locks in private context
@@ -4153,6 +4162,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
newidx = palloc_object(ReindexIndexInfo);
newidx->indexId = newIndexId;
newidx->auxIndexId = auxIndexId;
+ newidx->junkAuxIndexId = junkAuxIndexId;
newidx->safe = idx->safe;
newidx->tableId = idx->tableId;
newidx->amId = idx->amId;
@@ -4174,10 +4184,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
lockrelid = palloc_object(LockRelId);
*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
relationLocks = lappend(relationLocks, lockrelid);
+ if (OidIsValid(junkAuxIndexId))
+ {
+ lockrelid = palloc_object(LockRelId);
+ *lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+ relationLocks = lappend(relationLocks, lockrelid);
+ }
MemoryContextSwitchTo(oldcontext);
index_close(indexRel, NoLock);
+ if (OidIsValid(junkAuxIndexId))
+ index_close(junkAuxIndexRel, NoLock);
index_close(auxIndexRel, NoLock);
index_close(newIndexRel, NoLock);
@@ -4366,7 +4384,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/*
* At this moment all target indexes are marked as "ready-to-insert". So,
- * we are free to start process of dropping auxiliary indexes.
+ * we are free to start process of dropping auxiliary indexes - including
+ * junk indexes detected earlier.
*/
foreach(lc, newIndexIds)
{
@@ -4389,6 +4408,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
*/
PushActiveSnapshot(GetTransactionSnapshot());
index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+ /* Ensure the junk index is marked as non-ready */
+ if (OidIsValid(newidx->junkAuxIndexId))
+ index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
PopActiveSnapshot();
CommitTransactionCommand();
@@ -4608,6 +4630,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
PushActiveSnapshot(GetTransactionSnapshot());
index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+ if (OidIsValid(newidx->junkAuxIndexId))
+ index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
PopActiveSnapshot();
}
@@ -4659,6 +4683,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
object.objectSubId = 0;
add_exact_object_address(&object, objects);
+
+ if (OidIsValid(idx->junkAuxIndexId))
+ {
+ object.objectId = idx->junkAuxIndexId;
+ object.objectSubId = 0;
+ add_exact_object_address(&object, objects);
+ }
}
/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 67e42e5df29..87aba245b85 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1567,6 +1567,8 @@ RemoveRelations(DropStmt *drop)
ListCell *cell;
int flags = 0;
LOCKMODE lockmode = AccessExclusiveLock;
+ MemoryContext private_context,
+ oldcontext;
/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
if (drop->concurrent)
@@ -1631,9 +1633,20 @@ RemoveRelations(DropStmt *drop)
relkind = 0; /* keep compiler quiet */
break;
}
+ /*
+ * Create a memory context that will survive forced transaction commits we
+ * may need to do below (in case of concurrent index drop).
+ * Since it is a child of PortalContext, it will go away eventually even if
+ * we suffer an error; there's no need for special abort cleanup logic.
+ */
+ private_context = AllocSetContextCreate(PortalContext,
+ "RemoveRelations",
+ ALLOCSET_SMALL_SIZES);
+ oldcontext = MemoryContextSwitchTo(private_context);
/* Lock and validate each relation; build a list of object addresses */
objects = new_object_addresses();
+ MemoryContextSwitchTo(oldcontext);
foreach(cell, drop->objects)
{
@@ -1685,6 +1698,38 @@ RemoveRelations(DropStmt *drop)
flags |= PERFORM_DELETION_CONCURRENTLY;
}
+ /*
+ * Concurrent index drop requires it to be the first transaction. But in
+ * case we have junk auxiliary index - we want to drop it too (and also
+ * in a concurrent way). In this case perform silent internal deletion
+ * of auxiliary index, and restore transaction state. It is fine to do it
+ * in the loop because there is only single element in drop->objects.
+ */
+ if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+ state.actual_relkind == RELKIND_INDEX)
+ {
+ Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+ if (OidIsValid(junkAuxIndexOid))
+ {
+ ObjectAddress object;
+ object.classId = RelationRelationId;
+ object.objectId = junkAuxIndexOid;
+ object.objectSubId = 0;
+ performDeletion(&object, DROP_RESTRICT,
+ PERFORM_DELETION_CONCURRENTLY |
+ PERFORM_DELETION_INTERNAL |
+ PERFORM_DELETION_QUIETLY);
+ CommitTransactionCommand();
+ MemoryContextDelete(private_context);
+
+ /* And start again - now without auxiliary index. */
+ StartTransactionCommand();
+ PushActiveSnapshot(GetTransactionSnapshot());
+ RemoveRelations(drop);
+ return;
+ }
+ }
+
/*
* Concurrent index drop cannot be used with partitioned indexes,
* either.
@@ -1713,12 +1758,17 @@ RemoveRelations(DropStmt *drop)
obj.objectId = relOid;
obj.objectSubId = 0;
+ oldcontext = MemoryContextSwitchTo(private_context);
add_exact_object_address(&obj, objects);
+ MemoryContextSwitchTo(oldcontext);
}
+ /* Deletion may involve multiple commits, so, switch to memory context */
+ oldcontext = MemoryContextSwitchTo(private_context);
performMultipleDeletions(objects, drop->behavior, flags);
+ MemoryContextSwitchTo(oldcontext);
- free_object_addresses(objects);
+ MemoryContextDelete(private_context);
}
/*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 84f7cf9824e..c54748ff644 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
List *predicates, bool unique, bool nulls_not_distinct,
bool isready, bool concurrent, bool summarizing,
- bool withoutoverlaps, bool auxiliary)
+ bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
{
IndexInfo *n = makeNode(IndexInfo);
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
n->ii_Summarizing = summarizing;
n->ii_WithoutOverlaps = withoutoverlaps;
n->ii_Auxiliary = auxiliary;
+ n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
/* summarizing indexes cannot contain non-key attributes */
Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 2f3c1eae3c7..6ae210c584e 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -193,6 +193,7 @@ extern List *getOwnedSequences(Oid relid);
extern Oid getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
extern Oid get_index_constraint(Oid indexId);
+extern Oid get_auxiliary_index(Oid indexId);
extern List *get_index_ref_constraints(Oid indexId);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0f834889912..f97fcb7872c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -229,6 +229,8 @@ typedef struct IndexInfo
int ii_ParallelWorkers;
/* is auxiliary for concurrent index build? */
bool ii_Auxiliary;
+ /* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+ Oid ii_AuxiliaryForIndexId;
/* Oid of index AM */
Oid ii_Am;
/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index cd7f1eb0592..3a704781c8b 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
bool unique, bool nulls_not_distinct,
bool isready, bool concurrent,
bool summarizing, bool withoutoverlaps,
- bool auxiliary);
+ bool auxiliary, Oid auxiliary_for_index_id);
extern Node *makeStringConst(char *str, int location);
extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index d1723f47e89..2d6abb15a89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3279,20 +3279,109 @@ ERROR: reindex of auxiliary index "aux_index_ind6_ccaux" not supported
REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
ERROR: reindex of auxiliary index "aux_index_ind6_ccaux" not supported
-- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR: relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
- ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+ Table "public.aux_index_tab5"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR: could not create unique index "aux_index_ind6"
+DETAIL: Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
-- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR: could not create unique index "aux_index_ind6"
-DETAIL: Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
REINDEX TABLE CONCURRENTLY aux_index_tab5;
WARNING: skipping reindex of invalid index "public.aux_index_ind6"
HINT: Use DROP INDEX or REINDEX INDEX.
WARNING: skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
NOTICE: table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+ Table "public.aux_index_tab5"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "aux_index_ind6" UNIQUE, btree (c1) INVALID
+ "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+ Table "public.aux_index_tab5"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR: could not create unique index "aux_index_ind6"
+DETAIL: Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+ Table "public.aux_index_tab5"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+Indexes:
+ "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR: could not create unique index "aux_index_ind6"
+DETAIL: Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+ Table "public.aux_index_tab5"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+
+DROP INDEX aux_index_ind6;
+ERROR: index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR: could not create unique index "aux_index_ind6"
+DETAIL: Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+ Table "public.aux_index_tab5"
+ Column | Type | Collation | Nullable | Default
+--------+---------+-----------+----------+---------
+ c1 | integer | | |
+
DROP TABLE aux_index_tab5;
-- Check handling of indexes with expressions and predicates. The
-- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c2c1b031527..fd96d80abbc 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1344,11 +1344,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
-- Concurrently also
REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
-- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
-- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
DROP TABLE aux_index_tab5;
-- Check handling of indexes with expressions and predicates. The
--
2.43.0
[application/octet-stream] v31-0001-Add-stress-tests-for-concurrent-index-builds.patch (11.9K, 6-v31-0001-Add-stress-tests-for-concurrent-index-builds.patch)
download | inline diff:
From b6a36e045192906f72cb805f33f4cccafd780f89 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v31 1/7] Add stress tests for concurrent index builds
Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations
These tests verify the behavior of the following commits.
---
src/bin/pg_amcheck/meson.build | 1 +
src/bin/pg_amcheck/t/006_cic.pl | 273 ++++++++++++++++++++++++++++++++
2 files changed, 274 insertions(+)
create mode 100644 src/bin/pg_amcheck/t/006_cic.pl
diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 592cef74ecb..51a62dccb7b 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
't/003_check.pl',
't/004_verify_heapam.pl',
't/005_opclass_damage.pl',
+ 't/006_cic.pl',
],
},
}
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..0495ac10263
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,273 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use constant STRESS_PGBENCH_CLIENTS => 30;
+use constant STRESS_PGBENCH_JOBS => 8;
+use constant STRESS_PGBENCH_TRANSACTIONS => 10000;
+use constant STRESS_MAX_SLEEP_MS => 10;
+
+use constant DEFAULT_PGBENCH_CLIENTS => 15;
+use constant DEFAULT_PGBENCH_JOBS => 4;
+use constant DEFAULT_PGBENCH_TRANSACTIONS => 500;
+use constant DEFAULT_MAX_SLEEP_MS => 1;
+
+Test::More->builder->todo_start('filesystem bug')
+ if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+my $pg_test_extra = $ENV{PG_TEST_EXTRA} // '';
+my $is_stress = $pg_test_extra =~ /\bstress\b/ ? 1 : 0;
+my $pgbench_clients =
+ $is_stress ? STRESS_PGBENCH_CLIENTS : DEFAULT_PGBENCH_CLIENTS;
+my $pgbench_jobs = $is_stress ? STRESS_PGBENCH_JOBS : DEFAULT_PGBENCH_JOBS;
+my $pgbench_transactions =
+ $is_stress ? STRESS_PGBENCH_TRANSACTIONS : DEFAULT_PGBENCH_TRANSACTIONS;
+my $max_sleep_ms = $is_stress ? STRESS_MAX_SLEEP_MS : DEFAULT_MAX_SLEEP_MS;
+my $pgbench_options = sprintf(
+ '--no-vacuum --client=%d --jobs=%d --exit-on-abort --transactions=%d',
+ $pgbench_clients,
+ $pgbench_jobs,
+ $pgbench_transactions);
+my $no_hot = $is_stress ? int(rand(2)) : 0;
+
+print(
+ sprintf(
+ 'settings: PG_TEST_EXTRA=%s stress=%d clients=%d jobs=%d transactions=%d max_sleep_ms=%d no_hot=%d',
+ defined($ENV{PG_TEST_EXTRA})
+ ? ($pg_test_extra eq '' ? '(empty)' : $pg_test_extra)
+ : '(undef)',
+ $is_stress,
+ $pgbench_clients,
+ $pgbench_jobs,
+ $pgbench_transactions,
+ $max_sleep_ms,
+ $no_hot));
+print "\n";
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 32MB'); # to avoid OOM
+$node->append_conf('postgresql.conf', 'shared_buffers = 32MB'); # to avoid OOM
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+ c1 money default 0, c2 money default 0,
+ c3 money default 0, updated_at timestamp,
+ ia int4[], p point)));
+
+if ($no_hot) { $node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);)); }
+
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+ CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ EXECUTE 'SELECT txid_current()';
+ RETURN true;
+ END; $$;
+));
+
+$node->safe_psql('postgres', q(
+ CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+ LANGUAGE plpgsql AS $$
+ BEGIN
+ RETURN MOD($1, 2) = 0;
+ END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+ $pgbench_options,
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+ {
+ 'concurrent_ops' => sprintf(q(
+ SET debug_parallel_query = off; -- this is because predicate_stable implementation
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set variant random(0, 5)
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ \if :variant = 0
+ CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+ \elif :variant = 1
+ CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+ \elif :variant = 2
+ CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+ \elif :variant = 3
+ CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+ \elif :variant = 4
+ CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+ \elif :variant = 5
+ CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+ \endif
+ \set sleep_ms random(0, %d)
+ \sleep :sleep_ms ms
+ SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+ REINDEX INDEX CONCURRENTLY new_idx;
+ \set sleep_ms random(0, %d)
+ \sleep :sleep_ms ms
+ SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+ DROP INDEX CONCURRENTLY new_idx;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1000, 100000)
+ BEGIN;
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+ ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+ ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+ ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+ ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+ ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+ SELECT setval('in_row_rebuild', 1);
+ COMMIT;
+ \endif
+ ), $max_sleep_ms, $max_sleep_ms)
+ });
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+ $pgbench_options,
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+ {
+ 'concurrent_ops_unique_idx' => sprintf(q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+ \set sleep_ms random(0, %d)
+ \sleep :sleep_ms ms
+ SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+ REINDEX INDEX CONCURRENTLY new_idx;
+ \set sleep_ms random(0, %d)
+ \sleep :sleep_ms ms
+ SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+ DROP INDEX CONCURRENTLY new_idx;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1, power(10, random(1, 5)))
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+ ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+ SELECT setval('in_row_rebuild', 1);
+ \endif
+ ), $max_sleep_ms, $max_sleep_ms)
+ });
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+ $pgbench_options,
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+ {
+ 'concurrent_ops_gin_idx' => sprintf(q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+ \set sleep_ms random(0, %d)
+ \sleep :sleep_ms ms
+ SELECT gin_index_check('new_idx');
+ REINDEX INDEX CONCURRENTLY new_idx;
+ \set sleep_ms random(0, %d)
+ \sleep :sleep_ms ms
+ SELECT gin_index_check('new_idx');
+ DROP INDEX CONCURRENTLY new_idx;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1, power(10, random(1, 5)))
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+ ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+ SELECT setval('in_row_rebuild', 1);
+ \endif
+ ), $max_sleep_ms, $max_sleep_ms)
+ });
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+ $pgbench_options,
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIST/BRIN/HASH/SPGIST',
+ {
+ 'concurrent_ops_other_idx' => sprintf(q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ SELECT nextval('in_row_rebuild') AS last_value \gset
+ \set parallels random(0, 4)
+ \if :last_value < 3
+ ALTER TABLE tbl SET (parallel_workers=:parallels);
+ \set variant random(0, 3)
+ \if :variant = 0
+ CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+ \elif :variant = 1
+ CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+ \elif :variant = 2
+ CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+ \elif :variant = 3
+ CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+ \endif
+ \set sleep_ms random(0, %d)
+ \sleep :sleep_ms ms
+ REINDEX INDEX CONCURRENTLY new_idx;
+ \set sleep_ms random(0, %d)
+ \sleep :sleep_ms ms
+ DROP INDEX CONCURRENTLY new_idx;
+ \endif
+ SELECT pg_advisory_unlock(42);
+ \else
+ \set num random(1, power(10, random(1, 5)))
+ INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+ ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+ SELECT setval('in_row_rebuild', 1);
+ \endif
+ ), $max_sleep_ms, $max_sleep_ms)
+ });
+
+$node->stop;
+done_testing();
--
2.43.0
[application/octet-stream] v31-0006-Optimize-auxiliary-index-handling.patch (2.1K, 7-v31-0006-Optimize-auxiliary-index-handling.patch)
download | inline diff:
From bb56f91df5a44c7865e6f599738cdec476497021 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v31 6/7] Optimize auxiliary index handling
Skip unnecessary computations for auxiliary indices by:
- in the index-insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks
These optimizations reduce overhead during concurrent index operations.
---
src/backend/catalog/index.c | 11 +++++++++++
src/backend/executor/execIndexing.c | 5 ++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4b6a0f76c81..2d7d25f1986 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2917,6 +2917,17 @@ FormIndexDatum(IndexInfo *indexInfo,
ListCell *indexpr_item;
int i;
+ /* Auxiliary index does not need any values to be computed */
+ if (unlikely(indexInfo->ii_Auxiliary))
+ {
+ for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+ {
+ values[i] = PointerGetDatum(NULL);
+ isnull[i] = true;
+ }
+ return;
+ }
+
if (indexInfo->ii_Expressions != NIL &&
indexInfo->ii_ExpressionsState == NIL)
{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9d071e495c6..ce76a213556 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -438,8 +438,11 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
* There's definitely going to be an index_insert() call for this
* index. If we're being called as part of an UPDATE statement,
* consider if the 'indexUnchanged' = true hint should be passed.
+ *
+ * For auxiliary indexes, always pass false to skip value comparison checks,
+ * since auxiliary indexes only store TIDs and don't track value changes.
*/
- indexUnchanged = ((flags & EIIT_IS_UPDATE) &&
+ indexUnchanged = ((flags & EIIT_IS_UPDATE) && likely(!indexInfo->ii_Auxiliary) &&
index_unchanged_by_update(resultRelInfo,
estate,
indexInfo,
--
2.43.0
[application/octet-stream] v31-0007-Refresh-snapshot-periodically-during-index-valid.patch (22.5K, 8-v31-0007-Refresh-snapshot-periodically-during-index-valid.patch)
download | inline diff:
From 76c15aa5624a9dd861862bd42956cebf042459bc Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v31 7/7] Refresh snapshot periodically during index validation
Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.
The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach does not depend on single reference snapshot anymore.
---
src/backend/access/heap/README.HOT | 4 +-
src/backend/access/heap/heapam_handler.c | 65 +++++++++++++++++++++++-
src/backend/access/spgist/spgvacuum.c | 12 +++--
src/backend/catalog/index.c | 63 ++++++++++++++++-------
src/backend/commands/indexcmds.c | 50 +++---------------
src/include/access/tableam.h | 25 ++++-----
src/include/access/transam.h | 15 ++++++
src/include/catalog/index.h | 2 +-
8 files changed, 153 insertions(+), 83 deletions(-)
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index b1c797517ee..382fe1723a5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
We mark the index open for inserts (but still not ready for reads) then
we again wait for transactions which have the table open. Then validate
the index. This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if they are visible to reference snapshot.
+index, and inserts any missing ones if they are visible to a fresh snapshot.
Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
the value to be inserted is the one from the live tuple.
Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished. This ensures that nobody is
+the latest used snapshot is finished. This ensures that nobody is
alive any longer who could need to see any tuples that might be missing
from the index, as well as ensuring that no one can see any inconsistent
rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f90310a1ab8..78bc1bff70e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2022,23 +2022,26 @@ heapam_index_validate_scan_read_stream_next(
return result;
}
-static void
+static TransactionId
heapam_index_validate_scan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
- Snapshot snapshot,
ValidateIndexState *state,
ValidateIndexState *auxState)
{
+ TransactionId limitXmin;
+
Datum values[INDEX_MAX_KEYS];
bool isnull[INDEX_MAX_KEYS];
+ Snapshot snapshot;
TupleTableSlot *slot;
EState *estate;
ExprContext *econtext;
BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
int64 num_to_check;
+ BlockNumber page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
Tuplestorestate *tuples_for_check;
ValidateIndexScanState callback_private_data;
@@ -2049,6 +2052,8 @@ heapam_index_validate_scan(Relation heapRelation,
/* Use 10% of memory for tuple store. */
int store_work_mem_part = maintenance_work_mem / 10;
+ PushActiveSnapshot(GetTransactionSnapshot());
+
/*
* Encode TIDs as int8 values for the sort, rather than directly sorting
* item pointers. This can be significantly faster, primarily because TID
@@ -2057,6 +2062,12 @@ heapam_index_validate_scan(Relation heapRelation,
*/
tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
+ PopActiveSnapshot();
+ InvalidateCatalogSnapshot();
+
+ Assert(!HaveRegisteredOrActiveSnapshot());
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+
/*
* sanity checks
*/
@@ -2072,6 +2083,29 @@ heapam_index_validate_scan(Relation heapRelation,
state->tuplesort = auxState->tuplesort = NULL;
+ /*
+ * Now take the first snapshot that will be used to filter candidate
+ * tuples. We are going to replace it by newer snapshot every so often
+ * to propagate horizon.
+ *
+ * Beware! There might still be snapshots in use that treat some transaction
+ * as in-progress that our temporary snapshot treats as committed.
+ *
+ * If such a recently-committed transaction deleted tuples in the table,
+ * we will not include them in the index; yet those transactions which
+ * see the deleting one as still-in-progress will expect such tuples to
+ * be there once we mark the index as valid.
+ *
+ * We solve this by waiting for all endangered transactions to exit before
+ * we mark the index as valid, for that reason limitXmin is supported.
+ *
+ * We also set ActiveSnapshot to this snap, since functions in indexes may
+ * need a snapshot.
+ */
+ snapshot = RegisterSnapshot(GetLatestSnapshot());
+ PushActiveSnapshot(snapshot);
+ limitXmin = snapshot->xmin;
+
estate = CreateExecutorState();
econtext = GetPerTupleExprContext(estate);
slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2105,6 +2139,7 @@ heapam_index_validate_scan(Relation heapRelation,
LockBuffer(buf, BUFFER_LOCK_SHARE);
block_number = BufferGetBlockNumber(buf);
+ page_read_counter++;
i = 0;
while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2162,6 +2197,20 @@ heapam_index_validate_scan(Relation heapRelation,
}
ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+ if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+ {
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ /* to make sure we propagate xmin */
+ InvalidateCatalogSnapshot();
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+
+ snapshot = RegisterSnapshot(GetLatestSnapshot());
+ PushActiveSnapshot(snapshot);
+ /* xmin should not go backwards, but just in case */
+ limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+ }
}
ExecDropSingleTupleTableSlot(slot);
@@ -2171,11 +2220,23 @@ heapam_index_validate_scan(Relation heapRelation,
read_stream_end(read_stream);
tuplestore_end(tuples_for_check);
+ /*
+ * Drop the latest snapshot. We must do this before waiting out other
+ * snapshot holders, else we will deadlock against other processes also
+ * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+ * they must wait for.
+ */
+ PopActiveSnapshot();
+ UnregisterSnapshot(snapshot);
+ InvalidateCatalogSnapshot();
+ Assert(MyProc->xmin == InvalidTransactionId);
FreeAccessStrategy(bstrategy);
/* These may have been pointing to the now-gone estate */
indexInfo->ii_ExpressionsState = NIL;
indexInfo->ii_PredicateState = NULL;
+
+ return limitXmin;
}
/*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 6b7117b56b2..7ea60c18e6f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
* Add target TID to pending list if the redirection could have
* happened since VACUUM started. (If xid is invalid, assume it
* must have happened before VACUUM started, since REINDEX
- * CONCURRENTLY locks out VACUUM.)
+ * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+ * validation scan.)
*
* Note: we could make a tighter test by seeing if the xid is
* "running" according to the active snapshot; but snapmgr.c
* doesn't currently export a suitable API, and it's not entirely
* clear that a tighter test is worth the cycles anyway.
*/
- if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+ if (!TransactionIdIsValid(bds->myXmin) ||
+ TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
spgAddPendingTID(bds, &dt->pointer);
}
else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
/* Finish setting up spgBulkDeleteState */
initSpGistState(&bds->spgstate, index);
bds->pendingList = NULL;
- bds->myXmin = GetActiveSnapshot()->xmin;
bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
bds.stats = stats;
bds.callback = callback;
bds.callback_state = callback_state;
+ if (info->validate_index)
+ bds.myXmin = InvalidTransactionId;
+ else
+ bds.myXmin = GetActiveSnapshot()->xmin;
spgvacuumscan(&bds);
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
bds.stats = stats;
bds.callback = dummy_callback;
bds.callback_state = NULL;
+ bds.myXmin = GetActiveSnapshot()->xmin;
spgvacuumscan(&bds);
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2d7d25f1986..c37a786dafd 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -69,6 +69,7 @@
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "storage/proc.h"
#include "storage/smgr.h"
#include "utils/builtins.h"
#include "utils/fmgroids.h"
@@ -3514,8 +3515,9 @@ IndexCheckExclusion(Relation heapRelation,
* insert their new tuples into it. At the same moment we clear "indisready" for
* auxiliary index, since it is no more required to be updated.
*
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every so often.
* (Any tuples committed live after the snap will be inserted into the
* index by their originating transaction. Any tuples committed dead before
* the snap need not be indexed, because we will wait out all transactions
@@ -3528,7 +3530,7 @@ IndexCheckExclusion(Relation heapRelation,
* TIDs of both auxiliary and target indexes, and doing a "merge join" against
* the TID lists to see which tuples from auxiliary index are missing from the
* target index. Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
* particular order: auxiliary first, target last.
*
* Building a unique index this way is tricky: we might try to insert a
@@ -3541,21 +3543,24 @@ IndexCheckExclusion(Relation heapRelation,
* before it declares a uniqueness error.
*
* After completing validate_index(), we wait until all transactions that
- * were alive at the time of the reference snapshot are gone; this is
- * necessary to be sure there are none left with a transaction snapshot
- * older than the reference (and hence possibly able to see tuples we did
- * not index). Then we mark the index "indisvalid" and commit. Subsequent
- * transactions will be able to use it for queries.
+ * were alive at the time of the latest snapshot used during validation are
+ * gone; this is necessary to be sure there are none left with a transaction
+ * snapshot older than that (and hence possibly able to see tuples we did
+ * not index). The snapshot is periodically refreshed during the heap scan
+ * to propagate the xmin horizon, so limitXmin tracks the most recent one.
+ * Then we mark the index "indisvalid" and commit. Subsequent transactions
+ * will be able to use it for queries.
*
* Also, some actions to concurrent drop the auxiliary index are performed.
*/
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
{
Relation heapRelation,
indexRelation,
auxIndexRelation;
IndexInfo *indexInfo;
+ TransactionId limitXmin;
IndexVacuumInfo ivinfo, auxivinfo;
ValidateIndexState state, auxState;
Oid save_userid;
@@ -3605,8 +3610,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
* Fetch info needed for index_insert. (You might think this should be
* passed in from DefineIndex, but its copy is long gone due to having
* been built in a previous transaction.)
+ *
+ * We might need snapshot for index expressions or predicates.
*/
+ PushActiveSnapshot(GetTransactionSnapshot());
indexInfo = BuildIndexInfo(indexRelation);
+ PopActiveSnapshot();
/* mark build is concurrent just for consistency */
indexInfo->ii_Concurrent = true;
@@ -3642,6 +3651,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
NULL, TUPLESORT_NONE);
auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+ /* tuplesort_begin_datum may require catalog snapshot */
+ InvalidateCatalogSnapshot();
+
(void) index_bulk_delete(&auxivinfo, NULL,
validate_index_callback, &auxState);
/* If aux index is empty, merge may be skipped */
@@ -3661,7 +3673,13 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
index_close(indexRelation, NoLock);
table_close(heapRelation, NoLock);
- return;
+ PushActiveSnapshot(GetTransactionSnapshot());
+ limitXmin = GetActiveSnapshot()->xmin;
+ PopActiveSnapshot();
+ InvalidateCatalogSnapshot();
+
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ return limitXmin;
}
state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
@@ -3670,6 +3688,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
NULL, TUPLESORT_NONE);
state.htups = state.itups = state.tups_inserted = 0;
+ /* tuplesort_begin_datum may require catalog snapshot */
+ InvalidateCatalogSnapshot();
+
/* ambulkdelete updates progress metrics */
(void) index_bulk_delete(&ivinfo, NULL,
validate_index_callback, &state);
@@ -3689,19 +3710,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
pgstat_progress_update_multi_param(3, progress_index, progress_vals);
}
tuplesort_performsort(state.tuplesort);
+ /* tuplesort_performsort may require catalog snapshot */
+ InvalidateCatalogSnapshot();
+
tuplesort_performsort(auxState.tuplesort);
+ /* tuplesort_performsort may require catalog snapshot */
+ InvalidateCatalogSnapshot();
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/*
* Now merge both indexes
*/
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
- table_index_validate_scan(heapRelation,
- indexRelation,
- indexInfo,
- snapshot,
- &state,
- &auxState);
+ limitXmin = table_index_validate_scan(heapRelation,
+ indexRelation,
+ indexInfo,
+ &state,
+ &auxState);
/* Tuple sort closed by table_index_validate_scan */
Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3724,6 +3750,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
index_close(auxIndexRelation, NoLock);
index_close(indexRelation, NoLock);
table_close(heapRelation, NoLock);
+
+ Assert(!TransactionIdIsValid(MyProc->xmin));
+ return limitXmin;
}
/*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b430d4a5b34..0e7b961b170 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -596,7 +596,6 @@ DefineIndex(ParseState *pstate,
LockRelId heaprelid;
LOCKTAG heaplocktag;
LOCKMODE lockmode;
- Snapshot snapshot;
Oid root_save_userid;
int root_save_sec_context;
int root_save_nestlevel;
@@ -1814,32 +1813,11 @@ DefineIndex(ParseState *pstate,
/* Tell concurrent index builds to ignore us, if index qualifies */
if (safe_index)
set_indexsafe_procflags();
-
- /*
- * Now take the "reference snapshot" that will be used by validate_index()
- * to filter candidate tuples. Beware! There might still be snapshots in
- * use that treat some transaction as in-progress that our reference
- * snapshot treats as committed. If such a recently-committed transaction
- * deleted tuples in the table, we will not include them in the index; yet
- * those transactions which see the deleting one as still-in-progress will
- * expect such tuples to be there once we mark the index as valid.
- *
- * We solve this by waiting for all endangered transactions to exit before
- * we mark the index as valid.
- *
- * We also set ActiveSnapshot to this snap, since functions in indexes may
- * need a snapshot.
- */
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
/*
* Merge content of auxiliary and target indexes - insert any missing index entries.
*/
- validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
- limitXmin = snapshot->xmin;
+ limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
- PopActiveSnapshot();
- UnregisterSnapshot(snapshot);
/*
* The snapshot subsystem could still contain registered snapshots that
* are holding back our process's advertised xmin; in particular, if
@@ -1861,8 +1839,8 @@ DefineIndex(ParseState *pstate,
/*
* The index is now valid in the sense that it contains all currently
* interesting tuples. But since it might not contain tuples deleted just
- * before the reference snap was taken, we have to wait out any
- * transactions that might have older snapshots.
+ * before the last snapshot during validating was taken, we have to wait
+ * out any transactions that might have older snapshots.
*/
INJECTION_POINT("define-index-before-set-valid", NULL);
pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4427,7 +4405,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
{
ReindexIndexInfo *newidx = lfirst(lc);
TransactionId limitXmin;
- Snapshot snapshot;
StartTransactionCommand();
@@ -4442,13 +4419,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
if (newidx->safe)
set_indexsafe_procflags();
- /*
- * Take the "reference snapshot" that will be used by validate_index()
- * to filter candidate tuples.
- */
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- PushActiveSnapshot(snapshot);
-
/*
* Update progress for the index to build, with the correct parent
* table involved.
@@ -4460,16 +4430,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
progress_vals[3] = newidx->amId;
pgstat_progress_update_multi_param(4, progress_index, progress_vals);
- validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
- /*
- * We can now do away with our active snapshot, we still need to save
- * the xmin limit to wait for older snapshots.
- */
- limitXmin = snapshot->xmin;
-
- PopActiveSnapshot();
- UnregisterSnapshot(snapshot);
+ limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+ Assert(!TransactionIdIsValid(MyProc->xmin));
/*
* To ensure no deadlocks, we must commit and start yet another
@@ -4482,7 +4444,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
/*
* The index is now valid in the sense that it contains all currently
* interesting tuples. But since it might not contain tuples deleted
- * just before the reference snap was taken, we have to wait out any
+ * just before the latest snap was taken, we have to wait out any
* transactions that might have older snapshots.
*
* Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1a997537800..2380a593d71 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -701,12 +701,11 @@ typedef struct TableAmRoutine
TableScanDesc scan);
/* see table_index_validate_scan for reference about parameters */
- void (*index_validate_scan) (Relation table_rel,
- Relation index_rel,
- IndexInfo *index_info,
- Snapshot snapshot,
- ValidateIndexState *state,
- ValidateIndexState *aux_state);
+ TransactionId (*index_validate_scan) (Relation table_rel,
+ Relation index_rel,
+ IndexInfo *index_info,
+ ValidateIndexState *state,
+ ValidateIndexState *aux_state);
/* ------------------------------------------------------------------------
@@ -1829,20 +1828,18 @@ table_index_build_range_scan(Relation table_rel,
* Note: it is responsibility of that function to close sortstates in
* both `state` and `auxstate`.
*/
-static inline void
+static inline TransactionId
table_index_validate_scan(Relation table_rel,
Relation index_rel,
IndexInfo *index_info,
- Snapshot snapshot,
ValidateIndexState *state,
ValidateIndexState *auxstate)
{
- table_rel->rd_tableam->index_validate_scan(table_rel,
- index_rel,
- index_info,
- snapshot,
- state,
- auxstate);
+ return table_rel->rd_tableam->index_validate_scan(table_rel,
+ index_rel,
+ index_info,
+ state,
+ auxstate);
}
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 6fa91bfcdc0..b33084cb91a 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -417,6 +417,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
return b;
}
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+ if (!TransactionIdIsValid(a))
+ return b;
+
+ if (!TransactionIdIsValid(b))
+ return a;
+
+ if (TransactionIdFollows(a, b))
+ return a;
+ return b;
+}
+
/* return the newer of the two IDs */
static inline FullTransactionId
FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 727993d1a5a..91666663834 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -158,7 +158,7 @@ extern void index_build(Relation heapRelation,
bool isreindex,
bool parallel);
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
--
2.43.0
view thread (65+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
In-Reply-To: <CADzfLwVeQikArGV885zRMiSDW2-y=h=bvQiROvdq1Re1ojx9QA@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox