Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s3Hfq-00BOwy-PK for pgsql-hackers@arkaria.postgresql.org; Sat, 04 May 2024 15:51:42 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1s3Hfl-000f8a-Go for pgsql-hackers@arkaria.postgresql.org; Sat, 04 May 2024 15:51:38 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s3Hfl-000f8S-0X for pgsql-hackers@lists.postgresql.org; Sat, 04 May 2024 15:51:38 +0000 Received: from mail-ej1-x62c.google.com ([2a00:1450:4864:20::62c]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1s3Hfi-001QQH-SH for pgsql-hackers@postgresql.org; Sat, 04 May 2024 15:51:36 +0000 Received: by mail-ej1-x62c.google.com with SMTP id a640c23a62f3a-a58c89bda70so143137566b.3 for ; Sat, 04 May 2024 08:51:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1714837892; x=1715442692; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=jgj4uE2+W2FgeAloyYkpgCaazStUovxPd5fDD97peQQ=; b=CAM3iRjA19GhYDET+6Bzssy9Kye5uHUkW1zH0AA7jCotIbqSRuwAhJW8d2/mku77vE GUb4IO1d1PkCo5K37MAZrup58J4+FdBfx9QPzthGQ9WOvU7tGKQ9ZPBTyBrGPwFTDq60 +laVJXIQ06WYKe8j6rKo4SVscegE+2kaXklfO8tYsYt4iYGFn7VnOysuwNwkZjz42GH2 t0UTwR5LhoORqbIfr0ceoM73t7Ik8zQ/PLgw27fziEI2DJ5MohPF2htvu3QLNMMO/B9q q9Wnd8a2i4fXX/U//fFFQV0UcVcQS1FYj8roqIgylR7Vl4xO00GaA1bslvwVfEQ7wMsd llaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714837892; x=1715442692; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=jgj4uE2+W2FgeAloyYkpgCaazStUovxPd5fDD97peQQ=; b=O94xaI0iqXyHJjf9pyqu9oFvWnGE/NhEhe0Y99+QdTnaoKc4NCVEbEdeZt+Uskhtjo IeOu8UwdZRj6/YyqMLjtmk/kKUfb+uxot6+S7GncTENklM662rIM8wIFEy4o4h+LBOGp S5pRkZhEYok8EMvP85Rb5BKAe4Aa4w4es/GUecLijYgKS1FwloXSSWc6eXv6Ye9H8Qaw QLfpIeOsE0MgRiWlCNAgeobAHAs/YopAFFIMbhGPWJ8uoyMXHcjPUsgWSzU13kWpwuX1 YLn+m3bG7tUZOr15sA4oSHQKWRn1P48Q+OiR8OtFTAnEKFOtPhOsesuwTAZ8oCdR2uj0 RzXQ== X-Forwarded-Encrypted: i=1; AJvYcCX2suZZCN386IYbNimGqUCM0K1S62jHKiNwOGMSBzGagagAlMXo2ssFGJdwB0Z/c2lJeR6S3+QimLJP44EP25FtNMDGg2he0y1sajfT X-Gm-Message-State: AOJu0YwDSNhqcezKhUGisAefBGQ4aYO3E8E/7CjnhyUHIiUein7SmCLp hlzfTtgefZrtHZeFeILJ1LZIPtUK2o4XistL3XbQ3ebDU+OrMLFiuCS4MeCQU4vltx0/dq4TYtX X4b0uEq09pCN44Gxpaj/aASXFadvJJ1nU X-Google-Smtp-Source: AGHT+IE9b/K9F7mFelhmFZY6+6M21Y0jx9mXH9jEgKpwvSYTXTSXX0GDe2oqOyKwCxzXgJLWSgQS9j/3CrgtwWs9bPQ= X-Received: by 2002:a50:cd93:0:b0:572:8aab:441c with SMTP id p19-20020a50cd93000000b005728aab441cmr3588565edi.26.1714837892254; Sat, 04 May 2024 08:51:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Michail Nikolaev Date: Sat, 4 May 2024 17:51:20 +0200 Message-ID: Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements To: Matthias van de Meent Cc: Melanie Plageman , PostgreSQL Hackers Content-Type: multipart/alternative; boundary="0000000000006a53050617a2cf7c" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000006a53050617a2cf7c Content-Type: text/plain; charset="UTF-8" Hello, Matthias! > We can just release the current snapshot, and get a new one, right? I > mean, we don't actually use the transaction for much else than > visibility during the first scan, and I don't think there is a need > for an actual transaction ID until we're ready to mark the index entry > with indisready. > I suppose we could be resetting the snapshot every so often? Or use > multiple successive TID range scans with a new snapshot each? It seems like it is not so easy in that case. Because we still need to hold catalog snapshot xmin, releasing the snapshot which used for the scan does not affect xmin propagated to the horizon. That's why d9d076222f5b94a85e0e318339cfc44b8f26022d(1) affects only the data horizon, but not the catalog's one. So, in such a situation, we may: 1) starts scan from scratch with some TID range multiple times. But such an approach feels too complex and error-prone for me. 2) split horizons propagated by `MyProc` to data-related xmin and catalog-related xmin. Like `xmin` and `catalogXmin`. We may just mark snapshots as affecting some of the horizons, or both. Such a change feels easy to be done but touches pretty core logic, so we need someone's approval for such a proposal, probably. 3) provide some less invasive (but less non-kludge) way: add some kind of process flag like `PROC_IN_SAFE_IC_XMIN` and function like `AdvanceIndexSafeXmin` which changes the way backend affect horizon calculation. In the case of `PROC_IN_SAFE_IC_XMIN` `ComputeXidHorizons` uses value from `proc->safeIcXmin` which is updated by `AdvanceIndexSafeXmin` while switching scan snapshots. So, with option 2 or 3, we may avoid holding data horizon during the first phase scan by resetting the scan snapshot every so often (and, optionally, using `AdvanceIndexSafeXmin` in case of 3rd approach). The same will be possible for the second phase (validate). We may do the same "resetting the snapshot every so often" technique, but there is still the issue with the way we distinguish tuples which were missed by the first phase scan or were inserted into the index after the visibility snapshot was taken. So, I see two options here: 1) approach with additional index with some custom AM proposed by you. It looks correct and reliable but feels complex to implement and maintain. Also, it negatively affects performance of table access (because of an additional index) and validation scan (because we need to merge additional index content with visibility snapshot). 2) one more tricky approach. We may add some boolean flag to `Relation` about information of index building in progress (`indexisbuilding`). It may be easily calculated using `(index->indisready && !index->indisvalid)`. For a more reliable solution, we also need to somehow check if backend/transaction building the index still in progress. Also, it is better to check if index is building concurrently using the "safe_index" way. I think there is a non too complex and expensive way to do so, probably by addition of some flag to index catalog record. Once we have such a flag, we may "legally" prohibit `heap_page_prune_opt` affecting the relation updating `GlobalVisHorizonKindForRel` like this: if (rel != NULL && rel->rd_indexvalid && rel->rd_indexisbuilding) return VISHORIZON_CATALOG; So, in common it works this way: * backend building the index affects catalog horizon as usual, but data horizon is regularly propagated forward during the scan. So, other relations are processed by vacuum and `heap_page_prune_opt` without any restrictions * but our relation (with CIC in progress) accessed by `heap_page_prune_opt` (or any other vacuum-like mechanics) with catalog horizon to honor CIC work. Therefore, validating scan may be sure what none of the HOT-chain will be truncated. Even regular vacuum can't affect it (but yes, it can't be anyway because of relation locking). As a result, we may easily distinguish tuples missed by first phase scan, just by testing them against reference snapshot (which used to take visibility snapshot). So, for me, this approach feels non-kludge enough, safe and effective and the same time. I have a prototype of this approach and looks like it works (I have a good test catching issues with index content for CIC). What do you think about all this? [1]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793 --0000000000006a53050617a2cf7c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello, Matthias!

> We can just release the curre= nt snapshot, and get a new one, right? I
> mean, we don't actuall= y use the transaction for much else than
> visibility during the firs= t scan, and I don't think there is a need
> for an actual transac= tion ID until we're ready to mark the index entry
> with indisrea= dy.

> I suppose we could be resetting the snapshot every so often= ? Or use
> multiple successive TID range scans with a new snapshot ea= ch?

It seems like it is not so easy in that case. Because we still n= eed to hold catalog snapshot xmin, releasing the snapshot which used for th= e scan does not affect xmin propagated to the horizon.
That's why d9= d076222f5b94a85e0e318339cfc44b8f26022d(1) affects only the data horizon, bu= t not the catalog's one.

So, in such a situation, we may:
1) starts scan from scratch with some TID range multiple times. But such a= n approach=C2=A0feels too complex and error-prone for me.

2) split h= orizons propagated by `MyProc` to data-related xmin and catalog-related xmi= n. Like `xmin` and `catalogXmin`. We may just mark snapshots as affecting s= ome of the horizons, or both. Such a change feels easy to be done but touch= es pretty core logic, so we need someone's approval for such a proposal= , probably.

3) provide some less invasive (but less non-kludge) way:= add some kind of process flag like `PROC_IN_SAFE_IC_XMIN` and function lik= e `AdvanceIndexSafeXmin` which changes the way backend affect horizon calcu= lation. In the case of `PROC_IN_SAFE_IC_XMIN` `ComputeXidHorizons` uses val= ue from `proc->safeIcXmin` which is updated by `AdvanceIndexSafeXmin` wh= ile switching scan snapshots.

So, with option 2 or 3, we may avoid h= olding data horizon during the first phase scan by resetting the scan snaps= hot every so often (and, optionally, using `AdvanceIndexSafeXmin` in case o= f 3rd approach).


The same will be possible for the second phase = (validate).

We may do the same "resetting the snapshot every so= often" technique, but there is still the issue with the way we distin= guish tuples which were missed by the first phase scan or were inserted int= o the index after the visibility snapshot was taken.

So, I see two o= ptions here:

1) approach with additional index with some custom AM p= roposed by you.

=C2=A0 =C2=A0It looks correct and reliable but = feels complex to implement and maintain. Also, it negatively affects perfor= mance of table access (because of an additional index) and validation scan = (because we need to merge additional index content with visibility snapshot= ).

2) one more tricky approach.
=C2=A0
We may add some= boolean flag to `Relation` about information of index building in progress= (`indexisbuilding`).

It may be easily calculated using `(index->= indisready && !index->indisvalid)`. For a more reliable solution= , we also need to somehow check if backend/transaction building the index s= till in progress. Also, it is better to check if index is building concurre= ntly using the "safe_index" way.

I think there is a non to= o complex and expensive way to do so, probably by addition of some flag to = index catalog record.

Once we have such a flag, we may "legally= " prohibit `heap_page_prune_opt` affecting the relation updating `Glob= alVisHorizonKindForRel` like this:

=C2=A0 =C2=A0if (rel !=3D NULL &a= mp;& rel->rd_indexvalid && rel->rd_indexisbuilding)
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return VISHORIZON_CATALOG;

= So, in common it works this way:

* backend building the index affect= s catalog horizon as usual, but data horizon is regularly propagated forwar= d during the scan. So, other relations are processed by vacuum and `heap_pa= ge_prune_opt` without any restrictions

* but our relation (with CIC = in progress) accessed by `heap_page_prune_opt` (or any other vacuum-like me= chanics) with catalog horizon to honor CIC work. Therefore, validating scan= may be sure what none of the HOT-chain will be truncated. Even regular vac= uum can't affect it (but yes, it can't be anyway because of relatio= n locking).

As a result, we may easily distinguish tuples missed by = first phase scan, just by testing them against reference snapshot (which us= ed to take visibility snapshot).

So, for me, this approach feels non= -kludge enough, safe and effective and the same time.

I have a proto= type of this approach and looks like it works (I have a good test catching = issues with index content for CIC).

What do you think about all this= ?

[1]: https://github.com/postgres/postg= res/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070a= b7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793
--0000000000006a53050617a2cf7c--