Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rbSYT-00FVSK-TJ for pgsql-hackers@arkaria.postgresql.org; Sat, 17 Feb 2024 21:49:06 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rbSYP-008Hsx-Nw for pgsql-hackers@arkaria.postgresql.org; Sat, 17 Feb 2024 21:49:01 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rbSYP-008Hrp-Ac for pgsql-hackers@lists.postgresql.org; Sat, 17 Feb 2024 21:49:01 +0000 Received: from mail-lj1-x233.google.com ([2a00:1450:4864:20::233]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1rbSYM-007GkB-Is for pgsql-hackers@postgresql.org; Sat, 17 Feb 2024 21:49:00 +0000 Received: by mail-lj1-x233.google.com with SMTP id 38308e7fff4ca-2d2305589a2so5006011fa.1 for ; Sat, 17 Feb 2024 13:48:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708206537; x=1708811337; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=XwPBh5qg3OPyT7hp9K78w2DBfOohcZyAqkd2y+Vx9Ag=; b=iT+BxYkDFsfvmb1hjDL6EKssDJYEhZ7T75T01y8t8zjNBVemkz/aJ5I2QEWinObI4H 2500xHm7ccTGVp2YqKOSxqE4gniwWCLc0rLgj7gyQhgQN5ChPG4VrwPPFThsQVPb/kLx u/6sNaxdYeIjO7xyDCIlaIdGI4A4YRhf5//cr69R5EfsFtuWCA+e+JLya6iVwa5678xv Z2gmcpJhZxJ10mtQBO0Q31zXPV0ZKgnE/t+Xhqw3ETFgdFtRjkAQlFYkTsLUStilo8dM z7vEO9QniUpenhqke4Yu7OAsxrxDhKZTzHjX4ogmARFF+krWMiDzrhElckiqGfb2Ax86 uVUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708206537; x=1708811337; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=XwPBh5qg3OPyT7hp9K78w2DBfOohcZyAqkd2y+Vx9Ag=; b=oaniqGO5fjllrwdlWb8UFfo3mtTZ1ffuQ70Xbjy9QuV57kzeOqn1AnPRzQUId2RnKF //8MnJhfNvcl3j+6y5S7oCTrJ+ivuyuftdAjyoyMwAZ6VUmkETC0WEmvR+uZaLrTS6Dp IN8M1sWxeEQXe+F5j95l4nJHf2WoRnf7wqyIs/Os7OlZYyzyF43Chgk8vYxHATo+a9E1 d3aYAB9TpFnqNhYtJZ8uQa0o4BEqhsjcUWoDuIxU3l9ZLw5LUvkrOm+KijP9m1zL6/LK zSmWWOcHoPSD1jKEE7nVlMNLFAdDGCPWCISAmXEX5rAv+V3DT4sCWqcyf6C7hD8Vsibw OdNg== X-Forwarded-Encrypted: i=1; AJvYcCWCUjIAvRL+DG4wq+I5oI2y8Ld/Q6KRa7ZHe6XkrjkCKZVDGuo1oVO/8WHuKCICAf5seE0iO8KtVvej6gva7NljYOPQytQb+oeMxFK3 X-Gm-Message-State: AOJu0YwkvqJ6dX6MGhpeDRJpUNyLfeYZTG19y/1w4ZaYGzDqFWBhk8r9 eFmF18zZmUK8Gz1ox3dVrbATYuhvQAbe7FG4nPm7EgN0YaZ/1a6HXTx+/G1mdNCLGrHTfKxz0Wl DbatYKp3esJu25vX0Zl05c07tq0s= X-Google-Smtp-Source: AGHT+IEJd2OgDlr11JsGcXE/bnYHpOsGVXlDhwNjvuuGB+7lFeSkpWG3tJkuFWsXG81K1EcS34ORDNyzpXPqKBw6hGY= X-Received: by 2002:a2e:b94f:0:b0:2d2:1a3e:c3aa with SMTP id 15-20020a2eb94f000000b002d21a3ec3aamr2774708ljs.26.1708206536480; Sat, 17 Feb 2024 13:48:56 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Matthias van de Meent Date: Sat, 17 Feb 2024 22:48:44 +0100 Message-ID: Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements To: Michail Nikolaev Cc: Melanie Plageman , PostgreSQL Hackers , Alvaro Herrera Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, 1 Feb 2024, 17:06 Michail Nikolaev, wrote: > > > > > I just realised there is one issue with this design: We can't cheaply > > > > reset the snapshot during the second table scan: > > > > It is critically important that the second scan of R/CIC uses an index > > > > contents summary (made with index_bulk_delete) that was created while > > > > the current snapshot was already registered. I think the best way for this to work would be an index method that exclusively stores TIDs, and of which we can quickly determine new tuples, too. I was thinking about something like GIN's format, but using (generation number, tid) instead of ([colno, colvalue], tid) as key data for the internal trees, and would be unlogged (because the data wouldn't have to survive a crash). Then we could do something like this for the second table scan phase: 0. index->indisready is set [...] 1. Empty the "changelog index", resetting storage and the generation number. 2. Take index contents snapshot of new index, store this. 3. Loop until completion: 4a. Take visibility snapshot 4b. Update generation number of the changelog index, store this. 4c. Take index snapshot of "changelog index" for data up to the current stored generation number. Not including, because we only need to scan that part of the index that were added before we created our visibility snapshot, i.e. TIDs labeled with generation numbers between the previous iteration's generation number (incl.) and this iteration's generation (excl.). 4d. Combine the current index snapshot with that of the "changelog" index, and save this. Note that this needs to take care to remove duplicates. 4e. Scan segment of table (using the combined index snapshot) until we need to update our visibility snapshot or have scanned the whole table. This should give similar, if not the same, behavour as that which we have when we RIC a table with several small indexes, without requiring us to scan a full index of data several times. Attemp on proving this approach's correctness: In phase 3, after each step 4b: All matching tuples of the table that are in the visibility snapshot: * Were created before scan 1's snapshot, thus in the new index's snapshot, or * Were created after scan 1's snapshot but before index->indisready, thus not in the new index's snapshot, nor in the changelog index, or * Were created after the index was set as indisready, and committed before the previous iteration's visibility snapshot, thus in the combined index snapshot, or * Were created after the index was set as indisready, after the previous visibility snapshot was taken, but before the current visibility snapshot was taken, and thus definitely included in the changelog index. Because we hold a snapshot, no data in the table that we should see is removed, so we don't have a chance of broken HOT chains. Kind regards, Matthias van de Meent Neon (https://neon.tech)