Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vOge4-001J1a-0n for pgsql-hackers@arkaria.postgresql.org; Thu, 27 Nov 2025 18:23:08 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vOge2-007qCR-1m for pgsql-hackers@arkaria.postgresql.org; Thu, 27 Nov 2025 18:23:06 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vOge2-007qCJ-0e for pgsql-hackers@lists.postgresql.org; Thu, 27 Nov 2025 18:23:06 +0000 Received: from mail-lj1-x22a.google.com ([2a00:1450:4864:20::22a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vOge0-001pNw-11 for pgsql-hackers@postgresql.org; Thu, 27 Nov 2025 18:23:06 +0000 Received: by mail-lj1-x22a.google.com with SMTP id 38308e7fff4ca-37b935df7bfso10902191fa.2 for ; Thu, 27 Nov 2025 10:23:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764267778; x=1764872578; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=hle8F08ua+c01tKJ6kWFGJiIZrozYS9fjzP5KlyY0xU=; b=ncSNOoWWblSAPKXJoPXR/jZ7OA0SiVRuusa3dUcLONRdlhTyQVcwS2y8K1RDwGEvZX FU+O6kuINLvwmICHEtDBOfyna0u15TKpPcCcagLR1MtYB7zQDZYqgO3br42E+DlVjREq yJn4f5r7LEwAiznvarQ8yD/i6RaFCw0iX1oapDbQsvbxW4aA+Bz4I1R2xMLc7vpQ1Cpk eHhujI3kfZGrcgTOOhfJGdKX3I+Q2kEAiZnasU/KgNg67aOqrLCdmoujrK4JWFx95iev 1TwUSY+AifLFKqTBVZRUPpp92r+DIry8vyY6mYB9tnSJpfPCuQ2XJSgdGKG2aW85dAhP BTvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764267778; x=1764872578; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=hle8F08ua+c01tKJ6kWFGJiIZrozYS9fjzP5KlyY0xU=; b=j8VYaxBFTUt0pW/ohmt01AXAUhVsHtVubs7eDu+vwn7//gnFFM+9121biETDzdL6QR 0xCQYfXWDaGvOnh/saO4fal+w+sg7L99zUtHAUd4/o7CmRiSRtiwUc95HnazcqzYVW5r f1k4Px2Uz6tFwZVhKdmVuDifonjMvcusFPXOcPX6WNIVggRgMvFJ7FNCIXeCeo2NTBIV Y4kwYy6fGLG4ropR9Mausk9BC9N1YS6lfJGDf7JC/6mEZbAKFcYD0tXyydzDG9Jd9+wu vNbzaTGFmBTwMxT79saxtLdlGdFi+pV4DzFihsYMymg6pLUFGJzlP8OECEpgHh1qsOoP qgeg== X-Forwarded-Encrypted: i=1; AJvYcCWKjmPnDC9uuVsyVTz97yCwbV3Op0wHooLLLM59hJXv/LxOvAP5/gs1wMKFQLfncG+llaxzKX5IDlRhb7B/@postgresql.org X-Gm-Message-State: AOJu0YwPAosKehnxXPMgYSYyc4MGs3gfRKGLgt4PU+ts6YG3mhRjBNd4 sr72LOXA9UA04pEtcUmoEgvl44qFJDVjz7yvAaXW+bMMKpIZRh29gAw2pJrzDhmz8XCH7NUBxhh 4FRKzhBH0TkdpbLDf0/y2ho0HnenCHnw= X-Gm-Gg: ASbGncv9Ui9O3kELn+aEybPr35Hq0QK4AGNRvmTIgyKUbOXAysVA4IARSeoUF40UuOH DvJEODWENMygockRigDUZ27nGSEA2pMarEE9B44NJ1MTH1sHktIxPJY+cNipiaAjwLRYYlZ+qg3 dhDffkD9E9vsj0CLas7kR1W3nxLof94brCwCopZA3NNfNJmNZj912XzGNyw+eRdZaOsy1O8bzjT mw+BrW3g2xXn/xguuMpTcKqG7y/Rmlqe43AP7VY2i24nf26TTPiUPfA6RwC92uxpQtinaAQ5U1E PNcc1BFkf9OU/yEAGVox/pXa+TeYmvs0biNanKZUm708qWLryV0fiWbdlkPj+eDvsW/J X-Google-Smtp-Source: AGHT+IFDkOiWbeat6sfX2B49pRC00l1VGUgzEE1qaB5iNjGXjKthuB3DMekcI0hA7AXKnDiI+a4qf+XeKh2Si+Ha5Z0= X-Received: by 2002:a2e:8606:0:b0:37a:d89:392e with SMTP id 38308e7fff4ca-37cd9187a00mr54246981fa.7.1764267777932; Thu, 27 Nov 2025 10:22:57 -0800 (PST) MIME-Version: 1.0 References: <17483.1764262585@localhost> In-Reply-To: <17483.1764262585@localhost> From: Matthias van de Meent Date: Thu, 27 Nov 2025 19:22:46 +0100 X-Gm-Features: AWmQ_bntf5p-pxFl1-59v-luMCZi_SwHY3eERZaC_orQ13Ym7it4SWliRHhmLs0 Message-ID: Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements To: Antonin Houska Cc: Michail Nikolaev , PostgreSQL Hackers , Alvaro Herrera Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, 27 Nov 2025 at 17:56, Antonin Houska wrote: > > Michail Nikolaev wrote: > > > I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY > > improvements) in some lighter way. > > I haven't read the whole thread yet, but the effort to minimize the impact of > C/RIC on VACUUM seems to prevail. Following is one more proposal. The core > idea is that C/RIC should avoid indexing dead tuples, however snapshot is not > necessary to distinguish dead tuple from a live one. And w/o snapshot, the > backend executing C/RIC does not restrict VACUUM on other tables. > > Concurrent (re)build of unique index appears to be another topic of this > thread, but I think this approach should handle the problem too. The workflow > is: > > 1. Create an empty index. > > 2. Wait until all transactions are aware of the index, so they take the new > index into account when deciding on new HOT chains. (This is already > implemented.) > > 3. Set the 'indisready' flag so the index is ready for insertions. > > 4. While other transactions can insert their tuples into the index now, > process the table one page at a time this way: > > 4.1 Acquire (shared) content lock on the buffer. > > 4.3 Collect the root tuples of HOT chains - these and only these need to be > inserted into the index. > > 4.4 Unlock the buffer. > 5. Once the whole table is processed, insert the collected tuples into the > index. > > To avoid insertions of tuples that concurrent transactions have just > inserted, we'd need something like index.c:validate_index() (i.e. insert > into the index only the tuples that it does not contain yet), but w/o > snapshot because we already have the heap tuples collected. > > Also it'd make sense to wait for completion of all the transactions that > currently have the table locked for INSERT/UPDATE: some of these might have > inserted their tuples into the heap, but not yet into the index. If we > included some of those tuples into our collection and insert them into the > index first, the other transactions could end up with ERROR when inserting > those tuples again. > > 6. Set the 'indisvalid' flag so that the index can be used by queries. > > Note on pruning: As we only deal with the root tuples of HOT chains (4.3), > page pruning triggered by queries (heap_page_prune_opt) should not be > disruptive. Actually C/RIC can do the pruning itself it it appears to be > useful. For example, if whole HOT chain should be considered DEAD by the next > VACUUM, pruning is likely (depending on the OldestXid) to remove it so that we > do not insert TID of the root tuple into the index unnecessarily. [...] > Of course, I could have missed some important point, so please explain why > this concept is broken :-) Or let me know if something needs to be explained > more in detail. Thanks. 1. When do you select and insert tuples that aren't part of a hot chain into the index, i.e. tuples that were never updated after they got inserted into the table? Or is every tuple "part of a hot chain" even if the tuple wasn't ever updated? 2. HOT chains can be created while the index wasn't yet present, and thus the indexed attributes of the root tuples can be different from the most current tuple of a chain. If you only gather root tuples, we could index incorrect data for that HOT chain. The correct approach here is to index only the visible tuples, as those won't have been updated in a non-HOT manner without all indexed attributes being unchanged. 3. Having the index marked indisready before it contains any data is going to slow down the indexing process significantly: a. The main index build now must go through shared memory and buffer locking, instead of being able to use backend-local memory b. The tuple-wise insertion path (IndexAmRoutine->aminsert) can have a significantly higher overhead than the bulk insertion logic in ambuild(); in metrics of WAL, pages accessed (IO), and CPU cycles spent. So, I don't think moving away from ambuild() as basis for initially building the index this is such a great idea. (However, I do think that having an _option_ to build the index using ambuildempty()+aminsert() instead of ambuild() might be useful, if only to more easily compare "natural grown" indexes vs freshly built ones, but that's completely orthogonal to CIC snapshotting improvements.) Kind regards, Matthias van de Meent Databricks (https://www.databricks.com)