Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rF0xK-0082OA-Dn for pgsql-hackers@arkaria.postgresql.org; Sun, 17 Dec 2023 23:53:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rF0xI-004XGP-PZ for pgsql-hackers@arkaria.postgresql.org; Sun, 17 Dec 2023 23:53:56 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rF0xI-004XGG-7z for pgsql-hackers@lists.postgresql.org; Sun, 17 Dec 2023 23:53:56 +0000 Received: from mail-lf1-x133.google.com ([2a00:1450:4864:20::133]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1rF0xB-00Auad-4D for pgsql-hackers@postgresql.org; Sun, 17 Dec 2023 23:53:54 +0000 Received: by mail-lf1-x133.google.com with SMTP id 2adb3069b0e04-50e39ac39bcso222781e87.3 for ; Sun, 17 Dec 2023 15:53:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702857226; x=1703462026; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=A48LHgP63VPLgwHh2kQZVtjWS04Ua7ko6sces6ap3SY=; b=kmO87NJg90jiYfomiCRm/D221+8nNTL4kav92YcCyXu5Wn2NKyZs801/0qQKSWcTMF 95r6UoLJKr1SjTha+LdD0LPbGnyirx7fhE3/+e80x0rZnMV3io885f7JtnKh7ujX5iWV BpUCgOh71dtkqGa1NvEXc/O1rpf7yQj9mBhscawgkGjTe7MCApSqA+UTG7Gtn6kjHqZO aapXTfytWZerxmUfblA+xkX8C0NQpeUbGGPDAQimJVLLBU7c+rlWABYB4YJk2+x0Ea0S PCX0UTyxh2l00AbgU9nC3lHn/uMn5PmgFrbenYBtp4NdnWLZczWG4NJz1OytbE9FbdWo PrZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702857226; x=1703462026; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=A48LHgP63VPLgwHh2kQZVtjWS04Ua7ko6sces6ap3SY=; b=SZLfB6sFFXl9GZQNqBmtjMo2J3Rhe4OisVFuN/2R9yqUtzKpYEBOczAG9tcb7LzhTv zm3v5Y0cRw2Jsm9vm7V2D5OovcgmPE/ePS1N1i7xuixDSCDyJ6QrUr4DZEdzBtMUVoXy j7i40LB3p9TtZ3nAtxzEYCm7Kr3avg+/RpJG26IWckGz7k6GS9cJC6N7MKfjUX3Pf4AV 4m6dt0jOd03clMHFNB4mJvK5m91Amt5vmoxvi0Y6gztC+F9mEMBrUQaLnN4G5IyT+BuK 1kqZ7fUU0wcKbEPswgI0n55hHGZUSWqIezS8riQnZOdMxWfJrHRLfUJsf7oDsrJiJFMf u43A== X-Gm-Message-State: AOJu0YzwOqvoKcFT+Hxe3n7F3PprrXrZWFRWxKczkBmXoPJ87ZHkv7KC P9wo7B+iC6QIJwzy+jruSikwWTmLkeNxAAW2gw0= X-Google-Smtp-Source: AGHT+IFEgWb6tBPqLOloWm94fupzWjwi+pqbRYBl6om5rl6gGQqFtitaEFfcix9v6B8JbID2fcmJPOwp4np+/2rVcPg= X-Received: by 2002:a2e:ab09:0:b0:2cc:1ea3:6855 with SMTP id ce9-20020a2eab09000000b002cc1ea36855mr4594644ljb.75.1702857225739; Sun, 17 Dec 2023 15:53:45 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Matthias van de Meent Date: Mon, 18 Dec 2023 00:53:34 +0100 Message-ID: Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements To: Michail Nikolaev Cc: PostgreSQL Hackers , Alvaro Herrera Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Sun, 17 Dec 2023, 21:14 Michail Nikolaev, w= rote: > > Hello! > > > I've thought about alternative solutions, too: how about getting a new = snapshot every so often? > > We don't really care about the liveness of the already-scanned data; th= e snapshots used for RIC > > are used only during the scan. C/RIC's relation's lock level means vacu= um can't run to clean up > > dead line items, so as long as we only swap the backend's reported snap= shot (thus xmin) while > > the scan is between pages we should be able to reduce the time C/RIC is= the one backend > > holding back cleanup of old tuples. > > Hm, it looks like an interesting idea! It may be more dangerous, but > at least it feels much more elegant than an LP_DEAD-related way. > Also, feels like we may apply this to both phases (first and the second s= cans). > The original patch (1) was helping only to the second one (after call > to set_indexsafe_procflags). > > But for the first scan we allowed to do so only for non-unique indexes > because of: > > > * The reason for doing that is to avoid > > * bogus unique-index failures due to concurrent UPDATEs (we might see > > * different versions of the same row as being valid when we pass over t= hem, > > * if we used HeapTupleSatisfiesVacuum). This leaves us with an index t= hat > > * does not contain any tuples added to the table while we built the ind= ex. Yes, for that we'd need an extra scan of the index that validates uniqueness. I think there was a proposal (though it may only have been an idea) some time ago, about turning existing non-unique indexes into unique ones by validating the data. Such a system would likely be very useful to enable this optimization. > Also, (1) was limited to indexes without expressions and predicates > (2) because such may execute queries to other tables (sic!). Note that the use of such expressions would be a violation of the function's definition; it would depend on data from other tables which makes the function behave like a STABLE function, as opposed to the IMMUTABLE that is required for index expressions. So, I don't think we should specially care about being correct for incorrectly marked function definitions. > One possible solution is to add some checks to make sure no > user-defined functions are used. > But as far as I understand, it affects only CIC for now and does not > affect the ability to use the proposed technique (updating snapshot > time to time). > > However, I think we need some more-less formal proof it is safe - it > is really challenging to keep all the possible cases in the head. I=E2=80= =99ll > try to do something here. I just realised there is one issue with this design: We can't cheaply reset the snapshot during the second table scan: It is critically important that the second scan of R/CIC uses an index contents summary (made with index_bulk_delete) that was created while the current snapshot was already registered. If we didn't do that, the following would occur: 1. The index is marked ready for inserts from concurrent backends, but not yet ready for queries. 2. We get the bulkdelete data 3. A concurrent backend inserts a new tuple T on heap page P, inserts it into the index, and commits. This tuple is not in the summary, but has been inserted into the index. 4. R/CIC resets the snapshot, making T visible. 5. R/CIC scans page P, finds that tuple T has to be indexed but is not present in the summary, thus inserts that tuple into the index (which already had it inserted at 3) This thus would be a logic bug, as indexes assume at-most-once semantics for index tuple insertion; duplicate insertion are an error. So, the "reset the snapshot every so often" trick cannot be applied in phase 3 (the rescan), or we'd have to do an index_bulk_delete call every time we reset the snapshot. Rescanning might be worth the cost (e.g. when using BRIN), but that is very unlikely. Alternatively, we'd need to find another way to prevent us from inserting these duplicate entries - maybe by storing the scan's data in a buffer to later load into the index after another index_bulk_delete()? Counterpoint: for BRIN indexes that'd likely require a buffer much larger than the result index would be. Either way, for the first scan (i.e. phase 2 "build new indexes") this is not an issue: we don't care about what transaction adds/deletes tuples at that point. For all we know, all tuples of the table may be deleted concurrently before we even allow concurrent backends to start inserting tuples, and the algorithm would still work as it does right now. > Another possible issue may be caused by the new locking pattern - we > will be required to wait for all transaction started before the ending > of the phase to exit. What do you mean by "new locking pattern"? We already keep an ShareUpdateExclusiveLock on every heap table we're accessing during R/CIC, and that should already prevent any concurrent VACUUM operations, right? Kind regards, Matthias van de Meent Neon (https://neon.tech)