MIME-Version: 1.0
References: <CANtu0oiLc-+7h9zfzOVy2cv2UuYk_5MUReVLnVbOay6OgD_KGg@mail.gmail.com>
 <CAEze2WgW6pj48xJhG_YLUE1QS+n9Yv0AZQwaWeb-r+X=HAxU_g@mail.gmail.com> <CANtu0oizNtPUrPB0Mh+2vyjdijTX=LZvO5_dZN3+NqvE-CFPtw@mail.gmail.com>
In-Reply-To: <CANtu0oizNtPUrPB0Mh+2vyjdijTX=LZvO5_dZN3+NqvE-CFPtw@mail.gmail.com>
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Mon, 18 Dec 2023 00:53:34 +0100
Message-ID: <CAEze2Wi3BFLkFBcZ+Brfbr-mGBCcWXcWuHucnCnw5ZOQotc6Eg@mail.gmail.com>
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
To: Michail Nikolaev <michail.nikolaev@gmail.com>
Cc: PostgreSQL Hackers <pgsql-hackers@postgresql.org>, Alvaro Herrera <alvherre@2ndquadrant.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CAEze2Wi3BFLkFBcZ%2BBrfbr-mGBCcWXcWuHucnCnw5ZOQotc6Eg%40mail.gmail.com>
Precedence: bulk

On Sun, 17 Dec 2023, 21:14 Michail Nikolaev, <michail.nikolaev@gmail.com> w=
rote:
>
> Hello!
>
> > I've thought about alternative solutions, too: how about getting a new =
snapshot every so often?
> > We don't really care about the liveness of the already-scanned data; th=
e snapshots used for RIC
> > are used only during the scan. C/RIC's relation's lock level means vacu=
um can't run to clean up
> > dead line items, so as long as we only swap the backend's reported snap=
shot (thus xmin) while
> > the scan is between pages we should be able to reduce the time C/RIC is=
 the one backend
> > holding back cleanup of old tuples.
>
> Hm, it looks like an interesting idea! It may be more dangerous, but
> at least it feels much more elegant than an LP_DEAD-related way.
> Also, feels like we may apply this to both phases (first and the second s=
cans).
> The original patch (1) was helping only to the second one (after call
> to set_indexsafe_procflags).
>
> But for the first scan we allowed to do so only for non-unique indexes
> because of:
>
> > * The reason for doing that is to avoid
> > * bogus unique-index failures due to concurrent UPDATEs (we might see
> > * different versions of the same row as being valid when we pass over t=
hem,
> > * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index t=
hat
> > * does not contain any tuples added to the table while we built the ind=
ex.

Yes, for that we'd need an extra scan of the index that validates
uniqueness. I think there was a proposal (though it may only have been
an idea) some time ago, about turning existing non-unique indexes into
unique ones by validating the data. Such a system would likely be very
useful to enable this optimization.

> Also, (1) was limited to indexes without expressions and predicates
> (2) because such may execute queries to other tables (sic!).

Note that the use of such expressions would be a violation of the
function's definition; it would depend on data from other tables which
makes the function behave like a STABLE function, as opposed to the
IMMUTABLE that is required for index expressions. So, I don't think we
should specially care about being correct for incorrectly marked
function definitions.

> One possible solution is to add some checks to make sure no
> user-defined functions are used.
> But as far as I understand, it affects only CIC for now and does not
> affect the ability to use the proposed technique (updating snapshot
> time to time).
>
> However, I think we need some more-less formal proof it is safe - it
> is really challenging to keep all the possible cases in the head. I=E2=80=
=99ll
> try to do something here.

I just realised there is one issue with this design: We can't cheaply
reset the snapshot during the second table scan:
It is critically important that the second scan of R/CIC uses an index
contents summary (made with index_bulk_delete) that was created while
the current snapshot was already registered. If we didn't do that, the
following would occur:

1. The index is marked ready for inserts from concurrent backends, but
not yet ready for queries.
2. We get the bulkdelete data
3. A concurrent backend inserts a new tuple T on heap page P, inserts
it into the index, and commits. This tuple is not in the summary, but
has been inserted into the index.
4. R/CIC resets the snapshot, making T visible.
5. R/CIC scans page P, finds that tuple T has to be indexed but is not
present in the summary, thus inserts that tuple into the index (which
already had it inserted at 3)

This thus would be a logic bug, as indexes assume at-most-once
semantics for index tuple insertion; duplicate insertion are an error.

So, the "reset the snapshot every so often" trick cannot be applied in
phase 3 (the rescan), or we'd have to do an index_bulk_delete call
every time we reset the snapshot. Rescanning might be worth the cost
(e.g. when using BRIN), but that is very unlikely.

Alternatively, we'd need to find another way to prevent us from
inserting these duplicate entries - maybe by storing the scan's data
in a buffer to later load into the index after another
index_bulk_delete()? Counterpoint: for BRIN indexes that'd likely
require a buffer much larger than the result index would be.

Either way, for the first scan (i.e. phase 2 "build new indexes") this
is not an issue: we don't care about what transaction adds/deletes
tuples at that point.
For all we know, all tuples of the table may be deleted concurrently
before we even allow concurrent backends to start inserting tuples,
and the algorithm would still work as it does right now.

> Another possible issue may be caused by the new locking pattern - we
> will be required to wait for all transaction started before the ending
> of the phase to exit.

What do you mean by "new locking pattern"? We already keep an
ShareUpdateExclusiveLock on every heap table we're accessing during
R/CIC, and that should already prevent any concurrent VACUUM
operations, right?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)