Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vOuPy-00CJil-34 for pgsql-hackers@arkaria.postgresql.org; Fri, 28 Nov 2025 09:05:31 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vOuPx-00AA0g-1P for pgsql-hackers@arkaria.postgresql.org; Fri, 28 Nov 2025 09:05:29 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vOuPw-00AA0Y-39 for pgsql-hackers@lists.postgresql.org; Fri, 28 Nov 2025 09:05:29 +0000 Received: from mail-wr1-x435.google.com ([2a00:1450:4864:20::435]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vOuPu-001t22-28 for pgsql-hackers@postgresql.org; Fri, 28 Nov 2025 09:05:28 +0000 Received: by mail-wr1-x435.google.com with SMTP id ffacd0b85a97d-42bb288c1bfso940190f8f.2 for ; Fri, 28 Nov 2025 01:05:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cybertec.at; s=google; t=1764320726; x=1764925526; darn=postgresql.org; h=message-id:date:content-transfer-encoding:content-id:mime-version :comments:references:in-reply-to:subject:cc:to:from:from:to:cc :subject:date:message-id:reply-to; bh=npqxzp1MDODjXlpYzBKIi7e7MhPNWdHvqe1IKPASqps=; b=S6lzE3phTJoFdFmNF5kp1tliX/QAE2yihUrOdNS7GUP8XuOpYPXnbUNR5BpYb5bGTf 6pm36mxxGyXxwzEt9pEwXSEc0F4asKlcGc2+rqvlotW2Yp6Sef007EqDyV7DeSdmjGg3 pXM3yy1QbZRoCCos2g1zo2gbc8B/AImtAdaPJ+hUR2wiV2pjtNXqVO3wQNtszupByhlq UN8/V3FC3M0ReLnmaMD+G4tC7HRUCO14ybsb3LLKllpAAWH939OoDTSSJ3NBovxtXuqi qLJ4tsNz6OjF5r8jDsWYK4t1Jjme2Le20vLIsmwqSS++HXDSa/OydWtpPth4o7qxocYY 97uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764320726; x=1764925526; h=message-id:date:content-transfer-encoding:content-id:mime-version :comments:references:in-reply-to:subject:cc:to:from:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=npqxzp1MDODjXlpYzBKIi7e7MhPNWdHvqe1IKPASqps=; b=Xwbrs0lyyn4aOTj+8fG1sGl1yr9Qw1ydNJV8+uDYIbDwjhsUJwWjwwxy/XHqFrkT+i yj/0EYFNy8gf+lPggLEMLYbv+RlsK43WJ1G1qw9ab/NvUxOrd/W1J49pa8M8I0sICes3 lcsEgM0Yu6sXduBKlf+zUN4WCOFTlgkcopUzCEVrkNwBPM8CUZpfj4S/7kG8HqdS6gck 4BcpET656QAwLWi5Vxi6F5Mh+5+bedf2lpf/TU9sZz8/6zbYODhxsHEKdRyw+0W71ADu slcNAwajJzJd29CQKUQQkT67MJPAKktDgkdccIqbynjExvUPCe3vjCujcOJzS8swvwFA 9Wtw== X-Forwarded-Encrypted: i=1; AJvYcCUv8pBlrQTmXDzDdkaLJRHeybzX8+Q/gCEnTkctjadyGU+had4VyXcj7Ql8In1ikYAPqgb1PxCwAIsmzEKr@postgresql.org X-Gm-Message-State: AOJu0Ywsc+c7G4fTst1C95Td38Cdg83PQLarZFrmVhygkuSh3Qs8RrIm bdMIlNsOwAaCWdgffZ/3Iz0yw85THGDNzkAs2Wi7UjA9bqR7OIFWGbjQ3rsv5w3+DBo= X-Gm-Gg: ASbGncvU78hU8xodeTAdmTnfeFHJxO9sQj33tOHlRTWLgAKyNI/Hb9+PBaZDBXuh8EU zSRHHiD9hxGDtOlX6MY7bhFPkFOj0vvws6wpcJsNRrNxFDXGvFSKbQyBdFd0UyuGuUVjfBReI6f /ua3sUf4Uh+iwKgzJVb2Xn2bhPsiFyeSE0yqsOj4RqofWDxrL7MXZk+XJGF1zTC6wEFCzTeK4Ja XnwYjayEd9QyvYHJB63ypz5+5ZH49HCAjVbApll3VRPypBUa1Z/ggOiip0+giFYbg4IAmJnAr0V qBHOWBgqYIxU2YkRnxHSER7MPmCp0I3HC3Y6Xs1jfJMsXlqL/QyZHrMSePhrFjFdXTQEDxBbfZ4 JKlYGnxrZ0poddUFfh5zHGU95Df+9wCtDzsRruHMZRJPMJvdzj9PH7cv2csr3PnDBRJuhuZ7pvc FJdWIiJAQ/zzC0/qasGm1hPIrn X-Google-Smtp-Source: AGHT+IHNPEyVHHLW9kQOkTcj6zikhKfZyF6fbiATj0sTDGP7d9Cu9RuOuKNQprdxhM9M0NPDJVyXPA== X-Received: by 2002:a05:6000:1446:b0:42b:4194:48cf with SMTP id ffacd0b85a97d-42e0f35a9b4mr14872764f8f.52.1764320725653; Fri, 28 Nov 2025 01:05:25 -0800 (PST) Received: from localhost (109-81-168-246.rct.o2.cz. [109.81.168.246]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-42e1ca9aea3sm8239384f8f.35.2025.11.28.01.05.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 28 Nov 2025 01:05:24 -0800 (PST) From: Antonin Houska To: Matthias van de Meent cc: Michail Nikolaev , PostgreSQL Hackers , Alvaro Herrera Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements In-reply-to: References: <17483.1764262585@localhost> Comments: In-reply-to Matthias van de Meent message dated "Thu, 27 Nov 2025 19:22:46 +0100." X-Mailer: MH-E 8.6+git; nmh 1.8; GNU Emacs 28.3 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <8836.1764320724.1@localhost> Content-Transfer-Encoding: quoted-printable Date: Fri, 28 Nov 2025 10:05:24 +0100 Message-ID: <8837.1764320724@localhost> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Matthias van de Meent wrote: > On Thu, 27 Nov 2025 at 17:56, Antonin Houska wrote: > > > > Michail Nikolaev wrote: > > > > > I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY > > > improvements) in some lighter way. > > > > I haven't read the whole thread yet, but the effort to minimize the im= pact of > > C/RIC on VACUUM seems to prevail. Following is one more proposal. The = core > > idea is that C/RIC should avoid indexing dead tuples, however snapshot= is not > > necessary to distinguish dead tuple from a live one. And w/o snapshot,= the > > backend executing C/RIC does not restrict VACUUM on other tables. > > > > Concurrent (re)build of unique index appears to be another topic of th= is > > thread, but I think this approach should handle the problem too. The w= orkflow > > is: > > > > 1. Create an empty index. > > > > 2. Wait until all transactions are aware of the index, so they take th= e new > > index into account when deciding on new HOT chains. (This is alread= y > > implemented.) > > > > 3. Set the 'indisready' flag so the index is ready for insertions. > > > > 4. While other transactions can insert their tuples into the index now= , > > process the table one page at a time this way: > > > > 4.1 Acquire (shared) content lock on the buffer. > > > > 4.3 Collect the root tuples of HOT chains - these and only these ne= ed to be > > inserted into the index. > > > > 4.4 Unlock the buffer. > = > = > > 5. Once the whole table is processed, insert the collected tuples into= the > > index. > > > > To avoid insertions of tuples that concurrent transactions have jus= t > > inserted, we'd need something like index.c:validate_index() (i.e. i= nsert > > into the index only the tuples that it does not contain yet), but w= /o > > snapshot because we already have the heap tuples collected. > > > > Also it'd make sense to wait for completion of all the transactions= that > > currently have the table locked for INSERT/UPDATE: some of these mi= ght have > > inserted their tuples into the heap, but not yet into the index. If= we > > included some of those tuples into our collection and insert them i= nto the > > index first, the other transactions could end up with ERROR when in= serting > > those tuples again. > > > > 6. Set the 'indisvalid' flag so that the index can be used by queries. > > > > Note on pruning: As we only deal with the root tuples of HOT chains (4= .3), > > page pruning triggered by queries (heap_page_prune_opt) should not be > > disruptive. Actually C/RIC can do the pruning itself it it appears to = be > > useful. For example, if whole HOT chain should be considered DEAD by t= he next > > VACUUM, pruning is likely (depending on the OldestXid) to remove it so= that we > > do not insert TID of the root tuple into the index unnecessarily. > [...] > > Of course, I could have missed some important point, so please explain= why > > this concept is broken :-) Or let me know if something needs to be exp= lained > > more in detail. Thanks. > = > 1. When do you select and insert tuples that aren't part of a hot > chain into the index, i.e. tuples that were never updated after they > got inserted into the table? Or is every tuple "part of a hot chain" > even if the tuple wasn't ever updated? Right, I considered "standalone tuple" a HOT chain of length 1. So it'll b= e picked too. > 2. HOT chains can be created while the index wasn't yet present, and > thus the indexed attributes of the root tuples can be different from > the most current tuple of a chain. If you only gather root tuples, we > could index incorrect data for that HOT chain. The correct approach > here is to index only the visible tuples, as those won't have been > updated in a non-HOT manner without all indexed attributes being > unchanged. Good point. > 3. Having the index marked indisready before it contains any data is > going to slow down the indexing process significantly: > a. The main index build now must go through shared memory and buffer > locking, instead of being able to use backend-local memory > b. The tuple-wise insertion path (IndexAmRoutine->aminsert) can have a > significantly higher overhead than the bulk insertion logic in > ambuild(); in metrics of WAL, pages accessed (IO), and CPU cycles > spent. > = > So, I don't think moving away from ambuild() as basis for initially > building the index this is such a great idea. > = > (However, I do think that having an _option_ to build the index using > ambuildempty()+aminsert() instead of ambuild() might be useful, if > only to more easily compare "natural grown" indexes vs freshly built > ones, but that's completely orthogonal to CIC snapshotting > improvements.) The retail insertions are not something this proposal depends on. I think = it'd be possible to build a separate index locally and then "merge" it with the regular one. I just tried to propose a solution that does not need snapsho= ts. -- = Antonin Houska Web: https://www.cybertec-postgresql.com