Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVVLM-004XHb-1T for pgsql-hackers@arkaria.postgresql.org; Tue, 16 Dec 2025 13:44:02 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vVVLL-006rr1-1K for pgsql-hackers@arkaria.postgresql.org; Tue, 16 Dec 2025 13:44:00 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVVLK-006rqt-38 for pgsql-hackers@lists.postgresql.org; Tue, 16 Dec 2025 13:43:59 +0000 Received: from lahtoruutu.iki.fi ([185.185.170.37]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVVLJ-0013Of-0c for pgsql-hackers@postgresql.org; Tue, 16 Dec 2025 13:43:59 +0000 Received: from [10.0.2.15] (unknown [130.41.208.2]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: hlinnaka) by lahtoruutu.iki.fi (Postfix) with ESMTPSA id 4dVynk4zvGz49Q7b; Tue, 16 Dec 2025 15:43:54 +0200 (EET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=iki.fi; s=lahtoruutu; t=1765892636; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SejAsxSlJHxiJAYqw6lsJKTE2L4tjL+2mcGtf7WV60w=; b=Q9OhfdNbPqwZ3k7PXCLEF61Dp5sT5V6YNAV4kt44FXTnRf49Y+to6a/l45TkrUe/6lrRiE umwbD59m3JWwpHxHFJtXnysi9xgJ3DTMvtHhnDtAy0IApeRQvqL6Nb8eYSptu+q1aJhbgr XXwqMxxbFxHnkU6RZzB97sV3aEV/01bMYfxx+77K9j34eAvjWpD6ojCaavy9Kbw0Aqqg6R 5tW3C2WODn92YDmBBnst/2RjphLEyiOq+QWC6KvQUZhLyp88TbqLCSJ+1Zfq3H8wSDqOgI 8ECiaXm5iM7n5+QKBENz6bgDAtu7k9VYMF5rIaUEPRflQs2ijM6BP0okazyySQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=iki.fi; s=lahtoruutu; t=1765892636; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SejAsxSlJHxiJAYqw6lsJKTE2L4tjL+2mcGtf7WV60w=; b=bPOxdngcFLA8IGG6J8qHE+gpoUCTfVWkykz1+hHZu15xFIeGu7pQ45saGCvusVlFfaYGB7 51svo6nXO4DqTQtG6T0Xoy8bfc47wJ9PngrbSaSWQUSx+Y3ZmWbn4cJj+upDj+dSfxw9y7 SO+18rzTiXct6PA1DYUsqKEw2wcfIH5MktHbjR8fGHXBHMYEbeZcu/FqJWAFiFpfhmD0Uj KOxR2pwUfUYUhX/7Ie7GWqTgEJ+sg7/Xdk/OwdcQ8wxJotslphRS5HV6bGvkqlJRyzZll7 bKOWUXAhvhVyQ6P3aWQDqGnpMhMegAXpesPnmJhdKvK/3RLUjb60RoUEAYfXDQ== ARC-Authentication-Results: i=1; ORIGINATING; auth=pass smtp.auth=hlinnaka smtp.mailfrom=hlinnaka@iki.fi ARC-Seal: i=1; a=rsa-sha256; d=iki.fi; s=lahtoruutu; cv=none; t=1765892636; b=vIIXrAN6b4Tlsj6Eb7ohofmdxM3F+unm5i/Yp/y2mtPBq1eyTYsSs6Km2GCd6OIcyPwVp0 BiNl+qB/sbEdqJ84de+WeBZ1WR0uxLjxpscSk97QHs+V6GfN/ReW+jibcbrxWXXE1W1PVi qESdRiDaVlZ9tk+9KaP+W/ApeBiyqvAyhPpg6NkAn+VzwN/mouaGaL4oHZLBaEKM0TolzG LI4du2H4OOFbamp5+PkwG7q5eNbYNaLSfkJIH5Qga8R0CA7SUFDI7kK6RXs63HLuZpSxQ5 48M1lJGdRi0EREpmv6FVO3MBqvZHYMeQt0e/EkHJnGq/JrwyVB53N4OqZzh91A== Message-ID: <4b368c17-a614-401b-a335-398450249c47@iki.fi> Date: Tue, 16 Dec 2025 15:43:53 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements To: Hannu Krosing , PostgreSQL Hackers Cc: Antonin Houska , Matthias van de Meent , Mihail Nikalayeu , Sergey Sargsyan , alvherre@kurilemu.de, Andres Freund , Michael Paquier , Andrey Borodin , Melanie Plageman References: <5784.1764580169@localhost> <5293.1764837294@localhost> <120851.1764875744@localhost> Content-Language: en-US From: Heikki Linnakangas In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Didn't know which part of this thread to quote and reply to, so I'll comment on the whole thing. This is a mix of a summary of the ideas already discussed, and a new proposal. Firstly, I think the STIR approach is the right approach at the high level. I don't like the logical decoding idea, for the reasons Matthias and Mikhail already mentioned. Maybe there's some synergy with REPACK, but it feels different enough that I doubt it. Let's focus on the STIR approach. Summary of CIC as it is today ----------------------------- To recap, the CIC approach at very high level is: 1. Build the index, while backends are modifying the table concurrently 2. Retail insert all the tuples that we missed in step 1. A lot of logic and coordination goes into determining what was missed in step 1. Currently, it involves snapshots, waiting for concurrent transactions to finish, and re-scanning the index and the table. The STIR idea is to maintain a little data structure on the side where we collect items that are inserted between steps 1 and 2, to avoid re-scanning the table. Shmem struct ------------ One high-level observation: We're using the catalog for inter-process communication, with the indisready and indisvalid flags, and now with STIR by having a special, ephemeral index AM. That feels unnecessarily difficult. I propose that we introduce a little shared memory struct to keep track of in-progress CONCURRENTLY index builds. In the first transaction that inserts the catalog entry with indisready=false, also create a shmem struct. In that struct, we can store information about what state the build is in, and whether insertions should go to the STIR or to the real index. Avoid one wait-for-all-transactions step using the shmem struct --------------------------------------------------------------- As one small incremental improvement, we could use the shmem struct to avoid one of the "wait for all transactions" steps in the current implementation. In validate_index(), after we mark the index as 'indisready' we have to wait for all transactions to finish, to ensure that all subsequent insertions have seen the indisready=true change. We could avoid that by setting a flag in the shmem struct instead, so that all backends would see instantly that the flag is flipped. Improved STIR approach ---------------------- Here's another proposal using the STIR approach. It's a little different from the patches so far: - Instead of having an ephemeral index AM, I'm imagining that index_insert() has access to the shmem struct, and knows about the STIR and can redirect insertions to it. - I want to avoid re-scanning the index as well as the heap. To accomplish that, track more precisely which tuples are already in the index and which are not, by storing XID cutoffs in the shmem struct. The proposal: 1. Insert the catalog entry with indisvalid = false and indisready = false. Commit the transaction. 2. Wait for all transactions to finish. - Now we know that all subsequently-started transactions will see the index and will take it into account when deciding HOT chains. (No changes to current implementation so far) - All subsequently-started transactions will now also check the shmem struct for the status of the index build, in index_insert(). We'll use the shmem struct to coordinate the later steps. 3. Atomically do the following: 3.1 Take snapshot A 3.2 Store the snapshot's xmax in the shmem struct where all concurrent backends can see it. Let's call this "cutoff A". After this step, whenever a backend inserts a new tuple, it will append its TID to the STIR if the transaction's XID >= cutoff A. (No insertions to the actual index yet) 4. Build the index using snapshot A. It will include all tuples visible or in-progress according to the snapshot. 5. Atomically do the following: 5.1. Take snapshot B 5.2. Store the snapshot's xmax in the shmem struct. We'll call this cutoff B. From now on, backends insert all tuples >= cutoff B directly to the index. Tuples between A and B continue to be appended to the STIR. 6. Wait for all transactions < B to finish. At this stage: - All tuples < A are in the index. They were included in the bulk ambuild. - All tuples between A and B are in the STIR. - All tuples >= B are inserted to the index by the backends 7. Retail insert all the tuples from the STIR to the index. Snapshot refreshing ------------------- The above proposal doesn't directly accomplish the original goal of advancing the global xmin horizon. You still need two long-lived snapshots. It does however make CIC faster, by eliminating the full index scan and table scan in the validate_index() stage. That already helps a little. I believe it can be extended to also advance xmin horizon: - In step 4, while we are building the index, we can periodically get a new snapshot, update the cutoff in the shmem struct, and drain the STIR of the tuples that are already in it. - In step 7, we can take a new snapshot as often as we like. The snapshot is only used to evaluate expressions. - Heikki