Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vOfIG-000Jy1-25 for pgsql-hackers@arkaria.postgresql.org; Thu, 27 Nov 2025 16:56:32 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vOfIF-007U2x-0z for pgsql-hackers@arkaria.postgresql.org; Thu, 27 Nov 2025 16:56:31 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vOfIE-007U2n-2p for pgsql-hackers@lists.postgresql.org; Thu, 27 Nov 2025 16:56:31 +0000 Received: from mail-wr1-x435.google.com ([2a00:1450:4864:20::435]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vOfIC-001mFl-0d for pgsql-hackers@postgresql.org; Thu, 27 Nov 2025 16:56:30 +0000 Received: by mail-wr1-x435.google.com with SMTP id ffacd0b85a97d-42bb288c1bfso644703f8f.2 for ; Thu, 27 Nov 2025 08:56:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cybertec.at; s=google; t=1764262586; x=1764867386; darn=postgresql.org; h=message-id:date:content-id:mime-version:comments:references :in-reply-to:subject:cc:to:from:from:to:cc:subject:date:message-id :reply-to; bh=n3lcRF8aUUro9mUn3T7v99V8a131aaGlp7Ee0my9gZY=; b=kCSxtmSgoyWrWcB+PnEG601/Y2wAoq4FRAGhcYfsKz2avQohOmZAtLDaVsYaX//xZS KWoUUWvklzwyhnpxffOdCi0+wl6bsY54UYJx/ZFBnf6Wc9vcWCNquNtcItROwMOcix3d 1VrMqpErrgs3nWaiPm7PnlISKW1sfTA3u0VQP8JwnI4r0cOPFRjedM/pjDndFtEd5+3c yE1nneUaQoRJ3SNgp6KCPkUjF4GHX4PRtoR288H9Nyx4GWXhtAW5BZjhpYTTypixTIty aDXVkUOQxQjb/YsvecDargAUvWr6fH6Yq7K77CfiRU8mXwtT3vCVdqtOcpzUSEO4sD+P y0wA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764262586; x=1764867386; h=message-id:date:content-id:mime-version:comments:references :in-reply-to:subject:cc:to:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=n3lcRF8aUUro9mUn3T7v99V8a131aaGlp7Ee0my9gZY=; b=Z3nMTlQ9B6Wgqzf9VaJrtMGzqIxXyuP/H8YawKYd4C3KOSck3xKlrlDmz80DVL8ynD 4EJ0OyoDhlK+2mM4ju0sUNl8q/VC/Be4/hRJdV/QvlR+Atflh03CGO1taEKfdI1sPqiM eFW6NwFzCDfaSCVwDMuHZ0fPKoxaOo+YfLlSPbCW1uDwQZrqimSmi9yWcYkV3mHIYPbf RGWn7gRdpSVcVIcOZoV25iyoYZETDQHWeFH/oiknHRaCkVMlGQTfe6B3zR5l5oc9uTUa 41tU0cjt14teKCaTVwZ9O8Z3KpgFWbPDMnBWMW2AMDPSys46VedG5yd+n8GvHQsVsKZd 5b0w== X-Gm-Message-State: AOJu0YzwM6AGjYQZAIsQSQVvSrZacui0ObhKHowNhyAsJ1gudoyuhyHu LNF4eBwnHrZg/D3NIeVIo7shzoxdiORAthARz3Axs0dfeht3lQzRQICy4KWNnER/WLqtPkkvUtS KnYjd X-Gm-Gg: ASbGncvIKUBt2nsCDu/X9QXbgQWjN+D9mdnRbXBav912BeYVQOgG+cw6RwlXozcMq+C S6n1k6fcxVkcJllPlw1GnbuamE8P4vEPgHEazB7hNECyvgz0WR+4nPFy84ZiWQllW+Hs9XDLDWt g+++PU1v6GRZ0Urh3xOWsD5RCcWC/3C+/h6qnO40XKX7aKtrwfAQfiBOSP/yIlyt/R41VTdZTAF r00EzdfgSGzN6fDPUnFbwJFQ568UfwXxvv/eJO+OlSw561GSNRs91uZTY+VHXUC270YQt1uJ3iX yO9XeZNLBcmu8owd3hH7K/QOuih3LU7nlf8IppsqNYPO3q0vzjp2mYzn/giHU7pUSjvRfT5H+w7 ZHJdhoWhoAl8Yvh85binCYpuM8hAzUkub711ZqXRS2dypDIphakciVpDsl22uJ2MhVjRNrpU9tR g6w6Lfru5qUykRSuANXa2PKmmt X-Google-Smtp-Source: AGHT+IHFYk3Hv3QTA1843hAEbizvEphYcrEGTpROI1MR+IYQPJNnP4FWslCiPnO0nV4nytfa4OyKBA== X-Received: by 2002:a05:6000:2307:b0:42b:3bc4:16dc with SMTP id ffacd0b85a97d-42e0f213b9emr11658620f8f.21.1764262586017; Thu, 27 Nov 2025 08:56:26 -0800 (PST) Received: from localhost (109-81-168-246.rct.o2.cz. [109.81.168.246]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-42e1c5d614asm4480435f8f.12.2025.11.27.08.56.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Nov 2025 08:56:25 -0800 (PST) From: Antonin Houska To: Michail Nikolaev cc: PostgreSQL Hackers , Alvaro Herrera Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements In-reply-to: References: Comments: In-reply-to Michail Nikolaev message dated "Fri, 15 Dec 2023 20:07:29 +0100." X-Mailer: MH-E 8.6+git; nmh 1.8; GNU Emacs 28.3 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <17482.1764262585.1@localhost> Date: Thu, 27 Nov 2025 17:56:25 +0100 Message-ID: <17483.1764262585@localhost> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Michail Nikolaev wrote: > I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY > improvements) in some lighter way. I haven't read the whole thread yet, but the effort to minimize the impact of C/RIC on VACUUM seems to prevail. Following is one more proposal. The core idea is that C/RIC should avoid indexing dead tuples, however snapshot is not necessary to distinguish dead tuple from a live one. And w/o snapshot, the backend executing C/RIC does not restrict VACUUM on other tables. Concurrent (re)build of unique index appears to be another topic of this thread, but I think this approach should handle the problem too. The workflow is: 1. Create an empty index. 2. Wait until all transactions are aware of the index, so they take the new index into account when deciding on new HOT chains. (This is already implemented.) 3. Set the 'indisready' flag so the index is ready for insertions. 4. While other transactions can insert their tuples into the index now, process the table one page at a time this way: 4.1 Acquire (shared) content lock on the buffer. 4.3 Collect the root tuples of HOT chains - these and only these need to be inserted into the index. 4.4 Unlock the buffer. 5. Once the whole table is processed, insert the collected tuples into the index. To avoid insertions of tuples that concurrent transactions have just inserted, we'd need something like index.c:validate_index() (i.e. insert into the index only the tuples that it does not contain yet), but w/o snapshot because we already have the heap tuples collected. Also it'd make sense to wait for completion of all the transactions that currently have the table locked for INSERT/UPDATE: some of these might have inserted their tuples into the heap, but not yet into the index. If we included some of those tuples into our collection and insert them into the index first, the other transactions could end up with ERROR when inserting those tuples again. 6. Set the 'indisvalid' flag so that the index can be used by queries. Note on pruning: As we only deal with the root tuples of HOT chains (4.3), page pruning triggered by queries (heap_page_prune_opt) should not be disruptive. Actually C/RIC can do the pruning itself it it appears to be useful. For example, if whole HOT chain should be considered DEAD by the next VACUUM, pruning is likely (depending on the OldestXid) to remove it so that we do not insert TID of the root tuple into the index unnecessarily. I can even think of letting VACUUM run on the same table that C/RIC is processing. In that case, interlocking would take place at page level: either C/RIC or VACUUM can acquire lock for particular page, but not both. This would be useful in cases C/RIC takes very long time. In this case, C/RIC *must not* insert TIDs of dead tuples into the index at all. Otherwise there could be race conditions such that VACUUM removes dead tuples from the index and marks the corresponding heap items as UNUSED, but C/RIC then re-inserts the index tuples. To avoid this problem, C/RIC needs to re-check each TID before it inserts it into the index and skip the insertion if the tuple (or the whole HOT-chain starting at this tuple) it points to is DEAD according to the OldestXmin that the most recent VACUUM used. (VACUUM could perhaps advertise its OldestXmin for C/RIC via shared memory.) Also, before this re-checking starts, it must be ensured that VACUUM does not start again, until the index creation is complete: a new run of VACUUM implies a new value of OldestXmin, i.e. need for more stringent re-checking of the heap tuples. Related question is which OldestXmin to use in the step 4.3. One option is to use *exactly* the OldestXmin shared VACUUM. However that wouldn't work if VACUUM starts while C/RIC is already in progress. (Which seems like a significant restriction.) Another option is to get the OldestXmin in the same way as VACUUM does. However, the value can thus be different from the one used by VACUUM: older if retrieved before VACUUM started and newer if retrieved while VACUUM was already running. The first case can be handled by the heap tuple re-checking (see above). The latter implies that, before setting 'indisvalid', C/RIC has to wait until all snapshots have their xmin >= this (more recent) OldestXmin. Otherwise some snapshots could miss data they should see. (An implication of the rule that C/RIC must not insert TIDs of dead tuples into the index is that VACUUM does not have to call the index AM bulk delete while C/RIC is running for that index. This would be just an optimization.) Of course, I could have missed some important point, so please explain why this concept is broken :-) Or let me know if something needs to be explained more in detail. Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com