Message-ID: <4b368c17-a614-401b-a335-398450249c47@iki.fi>
Date: Tue, 16 Dec 2025 15:43:53 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
To: Hannu Krosing <hannuk@google.com>,
 PostgreSQL Hackers <pgsql-hackers@postgresql.org>
Cc: Antonin Houska <ah@cybertec.at>,
 Matthias van de Meent <boekewurm+postgres@gmail.com>,
 Mihail Nikalayeu <mihailnikalayeu@gmail.com>,
 Sergey Sargsyan <sergey.sargsyan.2001@gmail.com>, alvherre@kurilemu.de,
 Andres Freund <andres@anarazel.de>, Michael Paquier <michael@paquier.xyz>,
 Andrey Borodin <amborodin86@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>
References: 
 <CADzfLwW9QczZW-E=McxcjUv0e5VMDctQNETbgao0K-SimVhFPA@mail.gmail.com>
 <CADzfLwVYUBb8cUVQ_1mzVzNMyJH84VZKFCRyATvBZKbLW377CA@mail.gmail.com>
 <CADzfLwWbV1i7+cP_Hqr3qgQnBXkAqgrCQxd5PFzqp2AOTK=40w@mail.gmail.com>
 <CADzfLwXJc0jdDDS43-Fj0gKmwX-FURS3eY7MyLQ89qDPA6T5Ug@mail.gmail.com>
 <CADzfLwVaV15R2rUNZmKqLKweiN3SnUBg=6_qGE_ERb7cdQUD8g@mail.gmail.com>
 <CAEze2WgBffcC_SKGLmVxW8uRTEsrwWOHDQujN6zyxy1tSYLJ=Q@mail.gmail.com>
 <CADzfLwVon8ESWOkg+8KU0F9=Hg7QKriNVX-hqcm-v-XZmHkzig@mail.gmail.com>
 <CAEze2WiXYx1LKr=9d7PLsZOYrGytY9AN__tFFw4p_Ysgm1-e5g@mail.gmail.com>
 <CADzfLwUKXcXKZgX+e8ACsOXe_CgtWmNJY_6dyn8EO0AXYOn2pA@mail.gmail.com>
 <CAEze2WiiR2PeXg_vaURjjiiwvjQ=Um8wxWi1BcVS0BGyxiD2gQ@mail.gmail.com>
 <5784.1764580169@localhost>
 <CAEze2Wi_wZqxzZE0g==XuROUUnhpLwZM6=9Snkxbo7-ZH2J2yg@mail.gmail.com>
 <5293.1764837294@localhost>
 <CAEze2WgCuzsFpH_=ySAToT7bAFyj_cb82ZFnMeexnd-NYiVeOw@mail.gmail.com>
 <120851.1764875744@localhost>
 <CAMT0RQRT2wjMtH0iDLkzqAh9=SSo+Oed7jDT3jvsHXOgjY91Gw@mail.gmail.com>
Content-Language: en-US
From: Heikki Linnakangas <hlinnaka@iki.fi>
In-Reply-To: 
 <CAMT0RQRT2wjMtH0iDLkzqAh9=SSo+Oed7jDT3jvsHXOgjY91Gw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Archived-At: 
 <https://www.postgresql.org/message-id/4b368c17-a614-401b-a335-398450249c47%40iki.fi>
Precedence: bulk

Didn't know which part of this thread to quote and reply to, so I'll 
comment on the whole thing. This is a mix of a summary of the ideas 
already discussed, and a new proposal.

Firstly, I think the STIR approach is the right approach at the high 
level. I don't like the logical decoding idea, for the reasons Matthias 
and Mikhail already mentioned. Maybe there's some synergy with REPACK, 
but it feels different enough that I doubt it. Let's focus on the STIR 
approach.

Summary of CIC as it is today
-----------------------------

To recap, the CIC approach at very high level is:

1. Build the index, while backends are modifying the table concurrently

2. Retail insert all the tuples that we missed in step 1.

A lot of logic and coordination goes into determining what was missed in 
step 1. Currently, it involves snapshots, waiting for concurrent 
transactions to finish, and re-scanning the index and the table.

The STIR idea is to maintain a little data structure on the side where 
we collect items that are inserted between steps 1 and 2, to avoid 
re-scanning the table.


Shmem struct
------------

One high-level observation:

We're using the catalog for inter-process communication, with the 
indisready and indisvalid flags, and now with STIR by having a special, 
ephemeral index AM. That feels unnecessarily difficult. I propose that 
we introduce a little shared memory struct to keep track of in-progress 
CONCURRENTLY index builds.

In the first transaction that inserts the catalog entry with 
indisready=false, also create a shmem struct. In that struct, we can 
store information about what state the build is in, and whether 
insertions should go to the STIR or to the real index.

Avoid one wait-for-all-transactions step using the shmem struct
---------------------------------------------------------------

As one small incremental improvement, we could use the shmem struct to 
avoid one of the "wait for all transactions" steps in the current 
implementation. In validate_index(), after we mark the index as 
'indisready' we have to wait for all transactions to finish, to ensure 
that all subsequent insertions have seen the indisready=true change. We 
could avoid that by setting a flag in the shmem struct instead, so that 
all backends would see instantly that the flag is flipped.


Improved STIR approach
----------------------

Here's another proposal using the STIR approach. It's a little different 
from the patches so far:

- Instead of having an ephemeral index AM, I'm imagining that 
index_insert() has access to the shmem struct, and knows about the STIR 
and can redirect insertions to it.

- I want to avoid re-scanning the index as well as the heap. To 
accomplish that, track more precisely which tuples are already in the 
index and which are not, by storing XID cutoffs in the shmem struct.


The proposal:

1. Insert the catalog entry with indisvalid = false and indisready = 
false. Commit the transaction.

2. Wait for all transactions to finish.
   - Now we know that all subsequently-started transactions will see the 
index and will take it into account when deciding HOT chains. (No 
changes to current implementation so far)
   - All subsequently-started transactions will now also check the shmem 
struct for the status of the index build, in index_insert(). We'll use 
the shmem struct to coordinate the later steps.

3. Atomically do the following:
3.1 Take snapshot A
3.2 Store the snapshot's xmax in the shmem struct where all concurrent 
backends can see it. Let's call this "cutoff A".

After this step, whenever a backend inserts a new tuple, it will append 
its TID to the STIR if the transaction's XID >= cutoff A. (No insertions 
to the actual index yet)

4. Build the index using snapshot A. It will include all tuples visible 
or in-progress according to the snapshot.

5. Atomically do the following:
5.1. Take snapshot B
5.2. Store the snapshot's xmax in the shmem struct. We'll call this 
cutoff B.

 From now on, backends insert all tuples >= cutoff B directly to the 
index. Tuples between A and B continue to be appended to the STIR.

6. Wait for all transactions < B to finish.

At this stage:
- All tuples < A are in the index. They were included in the bulk ambuild.
- All tuples between A and B are in the STIR.
- All tuples >= B are inserted to the index by the backends


7. Retail insert all the tuples from the STIR to the index.


Snapshot refreshing
-------------------

The above proposal doesn't directly accomplish the original goal of 
advancing the global xmin horizon. You still need two long-lived 
snapshots. It does however make CIC faster, by eliminating the full 
index scan and table scan in the validate_index() stage. That already 
helps a little.

I believe it can be extended to also advance xmin horizon:

- In step 4, while we are building the index, we can periodically get a 
new snapshot, update the cutoff in the shmem struct, and drain the STIR 
of the tuples that are already in it.

- In step 7, we can take a new snapshot as often as we like. The 
snapshot is only used to evaluate expressions.


- Heikki