MIME-Version: 1.0
References: <CANtu0oiLc-+7h9zfzOVy2cv2UuYk_5MUReVLnVbOay6OgD_KGg@mail.gmail.com>
 <CAEze2WgW6pj48xJhG_YLUE1QS+n9Yv0AZQwaWeb-r+X=HAxU_g@mail.gmail.com>
 <CANtu0oizNtPUrPB0Mh+2vyjdijTX=LZvO5_dZN3+NqvE-CFPtw@mail.gmail.com>
 <CAEze2Wi3BFLkFBcZ+Brfbr-mGBCcWXcWuHucnCnw5ZOQotc6Eg@mail.gmail.com>
 <CANtu0ojRX=osoiXL9JJG6g6qOowXVbVYX+mDsN+2jmFVe=eG7w@mail.gmail.com>
 <CAEze2Wg03Ps_StwEhgCdSn7VXY9ZUM=zCrf-m1dRZpTWv6wD_A@mail.gmail.com>
 <CANtu0oj66JjAq8xyRSeO=MuRHYS2XsYbhHRRESHtOcLJs=3+Sw@mail.gmail.com>
 <CANtu0ogT2Qn7-q_jK6+DqBQvFoTt69eQJDKxJARXV9pdWjd0Gg@mail.gmail.com>
 <CANtu0ogXgNkEuxbDRwznAZpxEXRmj3NzOen3y-RGHDwig0YBRw@mail.gmail.com>
 <CANtu0oi+FTMqDb+6Bv8w7VHiTFVMB1uAAip_P841WQH+ktPixw@mail.gmail.com>
 <CAEze2WgeyVnDb_j4gJQYC4+HcSsYQAdeRA1-F0KDnJ=Y0A_TzA@mail.gmail.com>
 <CANtu0oga9zqqEFhdmcWyJTK4d6EGMJsMB_LMgVSE8ar0xVm7Ew@mail.gmail.com>
 <CANtu0oirtBK_g4jxtw3jehSop3b0WSQaek5Sv5OGSXwxgcHwZQ@mail.gmail.com>
 <CANtu0oijWPRGRpaRR_OvT2R5YALzscvcOTFh-=uZKUpNJmuZtw@mail.gmail.com>
 <CAEze2WgHFnYdxkNUmvqxOc-cFUNEYaTqL7+Pei=CtA-ZrTOFyw@mail.gmail.com>
 <CANtu0oipL3e8fLnejbH4HnByMW6G_auR4v+ns8j-UHhuPW=9og@mail.gmail.com>
 <CANtu0ojmVw8GW5bJknnqSp7Dp1xEuoBewdu2imtQ2tGnWpiWEg@mail.gmail.com>
 <CAEze2WgNHTWfw_bP6O0zW_=vi1D-yi1nh6-JDj9kd=8UaB-zLA@mail.gmail.com>
 <CANtu0ojA5=rT8BN5==OAiQJZh8CAxD_U8thFhZ3mwrZQ6roNOA@mail.gmail.com>
 <CAEze2Wh3eSAnXFdY_6roNPb3WD-YsKbNLiKf=cPmAGHkPUd22w@mail.gmail.com>
 <CANtu0og_=ypCbH2ZFayn44i=CL0HAXKW390LfZhQ1F56HoFXtQ@mail.gmail.com> <CAEze2WghpUS29bJJh5GCZ+WtpO4qWmxiFF-CTWFiP4Qq62G58w@mail.gmail.com>
In-Reply-To: <CAEze2WghpUS29bJJh5GCZ+WtpO4qWmxiFF-CTWFiP4Qq62G58w@mail.gmail.com>
From: Michail Nikolaev <michail.nikolaev@gmail.com>
Date: Sat, 4 May 2024 17:51:20 +0200
Message-ID: <CANtu0oiuuGRvYRsH-y0iQjfc+JpT9o4mPUXVkz97+sW9BXA+FA@mail.gmail.com>
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
To: Matthias van de Meent <boekewurm+postgres@gmail.com>
Cc: Melanie Plageman <melanieplageman@gmail.com>, 
	PostgreSQL Hackers <pgsql-hackers@postgresql.org>
Content-Type: multipart/alternative; boundary="0000000000006a53050617a2cf7c"
Archived-At: <https://www.postgresql.org/message-id/CANtu0oiuuGRvYRsH-y0iQjfc%2BJpT9o4mPUXVkz97%2BsW9BXA%2BFA%40mail.gmail.com>
Precedence: bulk

--0000000000006a53050617a2cf7c
Content-Type: text/plain; charset="UTF-8"

Hello, Matthias!

> We can just release the current snapshot, and get a new one, right? I
> mean, we don't actually use the transaction for much else than
> visibility during the first scan, and I don't think there is a need
> for an actual transaction ID until we're ready to mark the index entry
> with indisready.

> I suppose we could be resetting the snapshot every so often? Or use
> multiple successive TID range scans with a new snapshot each?

It seems like it is not so easy in that case. Because we still need to hold
catalog snapshot xmin, releasing the snapshot which used for the scan does
not affect xmin propagated to the horizon.
That's why d9d076222f5b94a85e0e318339cfc44b8f26022d(1) affects only the
data horizon, but not the catalog's one.

So, in such a situation, we may:

1) starts scan from scratch with some TID range multiple times. But such an
approach feels too complex and error-prone for me.

2) split horizons propagated by `MyProc` to data-related xmin and
catalog-related xmin. Like `xmin` and `catalogXmin`. We may just mark
snapshots as affecting some of the horizons, or both. Such a change feels
easy to be done but touches pretty core logic, so we need someone's
approval for such a proposal, probably.

3) provide some less invasive (but less non-kludge) way: add some kind of
process flag like `PROC_IN_SAFE_IC_XMIN` and function like
`AdvanceIndexSafeXmin` which changes the way backend affect horizon
calculation. In the case of `PROC_IN_SAFE_IC_XMIN` `ComputeXidHorizons`
uses value from `proc->safeIcXmin` which is updated by
`AdvanceIndexSafeXmin` while switching scan snapshots.

So, with option 2 or 3, we may avoid holding data horizon during the first
phase scan by resetting the scan snapshot every so often (and, optionally,
using `AdvanceIndexSafeXmin` in case of 3rd approach).


The same will be possible for the second phase (validate).

We may do the same "resetting the snapshot every so often" technique, but
there is still the issue with the way we distinguish tuples which were
missed by the first phase scan or were inserted into the index after the
visibility snapshot was taken.

So, I see two options here:

1) approach with additional index with some custom AM proposed by you.

   It looks correct and reliable but feels complex to implement and
maintain. Also, it negatively affects performance of table access (because
of an additional index) and validation scan (because we need to merge
additional index content with visibility snapshot).

2) one more tricky approach.

We may add some boolean flag to `Relation` about information of index
building in progress (`indexisbuilding`).

It may be easily calculated using `(index->indisready &&
!index->indisvalid)`. For a more reliable solution, we also need to somehow
check if backend/transaction building the index still in progress. Also, it
is better to check if index is building concurrently using the "safe_index"
way.

I think there is a non too complex and expensive way to do so, probably by
addition of some flag to index catalog record.

Once we have such a flag, we may "legally" prohibit `heap_page_prune_opt`
affecting the relation updating `GlobalVisHorizonKindForRel` like this:

   if (rel != NULL && rel->rd_indexvalid && rel->rd_indexisbuilding)
           return VISHORIZON_CATALOG;

So, in common it works this way:

* backend building the index affects catalog horizon as usual, but data
horizon is regularly propagated forward during the scan. So, other
relations are processed by vacuum and `heap_page_prune_opt` without any
restrictions

* but our relation (with CIC in progress) accessed by `heap_page_prune_opt`
(or any other vacuum-like mechanics) with catalog horizon to honor CIC
work. Therefore, validating scan may be sure what none of the HOT-chain
will be truncated. Even regular vacuum can't affect it (but yes, it can't
be anyway because of relation locking).

As a result, we may easily distinguish tuples missed by first phase scan,
just by testing them against reference snapshot (which used to take
visibility snapshot).

So, for me, this approach feels non-kludge enough, safe and effective and
the same time.

I have a prototype of this approach and looks like it works (I have a good
test catching issues with index content for CIC).

What do you think about all this?

[1]:
https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793

--0000000000006a53050617a2cf7c
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello, Matthias!<br><br>&gt; We can just release the curre=
nt snapshot, and get a new one, right? I<br>&gt; mean, we don&#39;t actuall=
y use the transaction for much else than<br>&gt; visibility during the firs=
t scan, and I don&#39;t think there is a need<br>&gt; for an actual transac=
tion ID until we&#39;re ready to mark the index entry<br>&gt; with indisrea=
dy.<br><br>&gt; I suppose we could be resetting the snapshot every so often=
? Or use<br>&gt; multiple successive TID range scans with a new snapshot ea=
ch?<br><br>It seems like it is not so easy in that case. Because we still n=
eed to hold catalog snapshot xmin, releasing the snapshot which used for th=
e scan does not affect xmin propagated to the horizon.<br>That&#39;s why d9=
d076222f5b94a85e0e318339cfc44b8f26022d(1) affects only the data horizon, bu=
t not the catalog&#39;s one.<br><br>So, in such a situation, we may:<br><br=
>1) starts scan from scratch with some TID range multiple times. But such a=
n approach=C2=A0feels too complex and error-prone for me.<br><br>2) split h=
orizons propagated by `MyProc` to data-related xmin and catalog-related xmi=
n. Like `xmin` and `catalogXmin`. We may just mark snapshots as affecting s=
ome of the horizons, or both. Such a change feels easy to be done but touch=
es pretty core logic, so we need someone&#39;s approval for such a proposal=
, probably.<br><br>3) provide some less invasive (but less non-kludge) way:=
 add some kind of process flag like `PROC_IN_SAFE_IC_XMIN` and function lik=
e `AdvanceIndexSafeXmin` which changes the way backend affect horizon calcu=
lation. In the case of `PROC_IN_SAFE_IC_XMIN` `ComputeXidHorizons` uses val=
ue from `proc-&gt;safeIcXmin` which is updated by `AdvanceIndexSafeXmin` wh=
ile switching scan snapshots.<br><br>So, with option 2 or 3, we may avoid h=
olding data horizon during the first phase scan by resetting the scan snaps=
hot every so often (and, optionally, using `AdvanceIndexSafeXmin` in case o=
f 3rd approach).<br><br><br>The same will be possible for the second phase =
(validate).<br><br>We may do the same &quot;resetting the snapshot every so=
 often&quot; technique, but there is still the issue with the way we distin=
guish tuples which were missed by the first phase scan or were inserted int=
o the index after the visibility snapshot was taken.<br><br>So, I see two o=
ptions here:<br><br>1) approach with additional index with some custom AM p=
roposed by you.<br><br><div>=C2=A0 =C2=A0It looks correct and reliable but =
feels complex to implement and maintain. Also, it negatively affects perfor=
mance of table access (because of an additional index) and validation scan =
(because we need to merge additional index content with visibility snapshot=
).<br><br>2) one more tricky approach.</div><div>=C2=A0 <br>We may add some=
 boolean flag to `Relation` about information of index building in progress=
 (`indexisbuilding`).<br><br>It may be easily calculated using `(index-&gt;=
indisready &amp;&amp; !index-&gt;indisvalid)`. For a more reliable solution=
, we also need to somehow check if backend/transaction building the index s=
till in progress. Also, it is better to check if index is building concurre=
ntly using the &quot;safe_index&quot; way.<br><br>I think there is a non to=
o complex and expensive way to do so, probably by addition of some flag to =
index catalog record.<br><br>Once we have such a flag, we may &quot;legally=
&quot; prohibit `heap_page_prune_opt` affecting the relation updating `Glob=
alVisHorizonKindForRel` like this:<br><br>=C2=A0 =C2=A0if (rel !=3D NULL &a=
mp;&amp; rel-&gt;rd_indexvalid &amp;&amp; rel-&gt;rd_indexisbuilding)<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return VISHORIZON_CATALOG;<br><br>=
So, in common it works this way:<br><br>* backend building the index affect=
s catalog horizon as usual, but data horizon is regularly propagated forwar=
d during the scan. So, other relations are processed by vacuum and `heap_pa=
ge_prune_opt` without any restrictions<br><br>* but our relation (with CIC =
in progress) accessed by `heap_page_prune_opt` (or any other vacuum-like me=
chanics) with catalog horizon to honor CIC work. Therefore, validating scan=
 may be sure what none of the HOT-chain will be truncated. Even regular vac=
uum can&#39;t affect it (but yes, it can&#39;t be anyway because of relatio=
n locking).<br><br>As a result, we may easily distinguish tuples missed by =
first phase scan, just by testing them against reference snapshot (which us=
ed to take visibility snapshot).<br><br>So, for me, this approach feels non=
-kludge enough, safe and effective and the same time.<br><br>I have a proto=
type of this approach and looks like it works (I have a good test catching =
issues with index content for CIC).<br><br>What do you think about all this=
?<br><br>[1]: <a href=3D"https://github.com/postgres/postgres/commit/d9d076=
222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c9679=
6d84402640b9e386a4150ed97b179R1779-R1793">https://github.com/postgres/postg=
res/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070a=
b7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793</a><br></div></di=
v>

--0000000000006a53050617a2cf7c--