MIME-Version: 1.0
References: 
 <CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com>
 <CA+TgmoYjcBA6dw3nwiyfDzPXTCrxTZPXDMrc2TrDJcL1cPK6iA@mail.gmail.com>
 <CA+TgmoYru-vxoTKfwjQby30r2OkTXfb18Km_=VLs6qk8Akr0-g@mail.gmail.com>
 <CA+Tgmoau7yJtvbeH-0kPt1Q=Gt_ezRdgM35Q1=LT665U_86Etg@mail.gmail.com>
 <1299934.1773938807@sss.pgh.pa.us>
 <CA+TgmobOLrMn5jEinWNPL5SrDH1DPpo3a4j+S6-4yhsZwWgzLg@mail.gmail.com>
 <CA+TgmoZUN8FT1Ah=m6Uis5bHa4FUa+_hMDWtcABG17toEfpiUg@mail.gmail.com>
 <CA+TgmoYh2-kM+tscOz=jVYq9Tf4SRPVqzPojs3KLZcW6E9m1BQ@mail.gmail.com>
 <CA+TgmoaK=4w7-qknUo3QhUJ53pXZq=c=KgZmRyD+k7ytqfmgSg@mail.gmail.com>
 <CAP53Pkz3DSFaaowYvbO5LULf3NhydD_UhHkighfWf6_pwxiqUw@mail.gmail.com>
 <CA+TgmoZ45n5jaNKKgbbj4-kYV8WsPvUn=Z8HnoZ7tUb_p9WKXg@mail.gmail.com>
 <CA+TgmoYuWmN-00Ec5pY7zAcpSFQUQLbgAdVWGR9kOR-HM-fHrA@mail.gmail.com>
 <CAP53Pkzn_wZ-R-cPdD9XSQ9+myPUUsPMMqVBPNG3XWXhgfm1-Q@mail.gmail.com>
 <CA+Tgmobxbju8PrY_NULtPr7b7UShp4+Jqibm2Bou8TVS69gObQ@mail.gmail.com>
 <CA+TgmoZpQDJOz_W34Wkp-JA=MQpzLeV6dsDGt=04U0AD6c65RA@mail.gmail.com>
 <3683430.1775173413@sss.pgh.pa.us>
 <CA+TgmoazotFpQfGL=eTAZsFQd-_W8nVOG4oDyx5PFkDpH3_Pzg@mail.gmail.com>
In-Reply-To: 
 <CA+TgmoazotFpQfGL=eTAZsFQd-_W8nVOG4oDyx5PFkDpH3_Pzg@mail.gmail.com>
From: Robert Haas <robertmhaas@gmail.com>
Date: Fri, 3 Apr 2026 13:13:47 -0400
Message-ID: 
 <CA+TgmoadkuOMJjvYe2h6aznHFeePprGEQ8CgUpRK=47sB6DMAg@mail.gmail.com>
Subject: Re: pg_plan_advice
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Alexander Lakhin <exclusion@gmail.com>, Lukas Fittl <lukas@fittl.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CA%2BTgmoadkuOMJjvYe2h6aznHFeePprGEQ8CgUpRK%3D47sB6DMAg%40mail.gmail.com>
Precedence: bulk

On Thu, Apr 2, 2026 at 10:08=E2=80=AFPM Robert Haas <robertmhaas@gmail.com>=
 wrote:
> I'm not sure, either, and I agree that we have a problem. I'll give it
> some more thought tomorrow.

OK, here are my thoughts.

I don't believe it's viable to change pg_plan_advice in such a way
that it won't run into trouble in cases like this.  Somebody could
argue that the choice of INDEX_SCAN(table_name index_name) was bad
design, and that I should have done something like
INDEX_SCAN(table_name indexed_columns) instead, and that might be
true. There might also be an argument that we should have both things
with different spellings, and that might also very well be true. But
we don't really know that changing that design decision would fully
stabilize test_plan_advice. The regression tests can do anything they
like, as long as they reliably pass. It now seems optimistic to me to
suppose that an index with a different name is the only current or
future issue we'll ever have. I mean, if the table were small enough
not to care about whether an index scan or a sequential scan is used,
you could concurrently drop the one and only index altogether, and
what's test_plan_advice supposed to do about that?

So, I argue that there are three possible categories of solutions
here: (1) don't let the problem cases happen in the first place, (2)
detect that a problem has happened, or (3) give up on
test_plan_advice.

In category (1), the simplest idea would be (1a) to run the tests
serially. That would probably involve running them much less often,
like in one of the CI builds but not all of them or something like
that. Another idea that I had is to (1b) try to take stronger locks on
the relations involved to prevent concurrent DDL on them, like a
ShareUpdateExclusiveLock, or (1c) some kind of bespoke interlock
specific to test_plan_advice. I think that might cause random breakage
of other types, though. Another idea in this category is to try to
make the main regression tests "pg_plan_advice clean". I know Tom
already expressed opposition to that idea, but here me out: we could
(1d) have a separate test suite that still does stuff like this, so we
don't lose test coverage, and move some stuff there. Or, instead of
completely separating it, we could (1e) have two schedule files, one
of which includes all the tests and a second of which includes only
the tests that are test_plan_advice-clean. Although my theory that the
main regression tests couldn't have multiple different sessions
simultaneously doing DDL on the same objects has been proven wrong,
I'd still be willing to bet that it's a minority position. Of course,
as Tom pointed out, there could be a "long tail" of failures here, but
maybe we could create some throwaway infrastructure to help figure
that out. For example if we're mostly worried about tables, we could
have each backend accumulate a list of table OIDs that it touched and
spit that out into the log file when it exits. That wouldn't be
committable code, but it would be enough to let us run the regression
tests with that once and see what overlaps exist. I bet there's very
low risk of newly-added tests adding more such cases: the ones that we
have are probably ancient. Of course, maybe I'm wrong about that, too,
but it's a theory that we can discuss.

In category (2), what if, (2a) whenever we see advice feedback that
we'd otherwise print, we try replanning the query a THIRD time without
any supplied advice? If we generate different advice than we did the
first time we planned it, then we know for sure that something is
unstable, and we can decide not to complain about whatever went wrong.
This isn't completely guaranteed to work, though: what if concurrent
DDL changes something between planning cycle 1 and planning cycle 2
and then changes it back before planning cycle 3? But maybe it would
be acceptable to make a rule that the main regression test case
shouldn't do that, and adjust cases that currently do to work
otherwise. If we're not willing to make any rules at all to prevent
the main regression test suite from sabotaging test_plan_advice, then
it's probably doomed. And, I think there's a reasonable argument that
insisting that the main regression test suite absolutely has to change
the definition of an object in a way that test_plan_advice will care
about and then change it back to exactly the initial state while in a
concurrent session some other backend is running queries against that
object is tantamount to legislating deliberate sabotage. But that
said, this proposal has some other imperfections as well. In
particular, a bug that caused the third planning cycle to always
produce different results than the first would hide all future
problems that test_plan_advice might have caught, which is pretty sad.
Another variant of the same basic idea is to (2b) just detect when
we've seen any shared invalidations between the start of the first
planning cycle and the end of the second, and go "never mind, don't
complain even if we saw a problem". The problem with this idea is
that, as in the previous proposal, it might make the tests too
insensitive to real issues. But I wonder if this might be fixable.
Maybe we could (2c) make test_plan_advice take planner_hook and wrap a
loop around the problem: it just keeps replanning the query via
standard_planner (which would eventually reach
test_plan_advice_advisor) until no sinval messages are absorbed
between the start and end of planner, which I think we could detect
using SharedInvalidMessageCounter, or until some retry limit is
exhausted and we error out. I'd need to try this and see how well it
works out in practice, and how often the retry is actually hit, but it
seems like it might be somewhat viable.

In category (3), the most blunt option is obviously just (3a) throw
test_plan_advice away, which I think is probably dooming
pg_plan_advice to getting silently broken in the future. I don't
really have any other ideas in this category except for (1a) already
mentioned, which is sort of a hybrid solution.

My current thought is to do some research into (1e) and (2c).
Specifically, for (1e), I want to try to figure out if this is the
only case of this type or if there are lots of others, since that
seems likely to have a pretty large bearing on what is realistic here.
And for (2c), I think I just want to try it out and see if it seems at
all feasible. Probably obviously, this is not going to happen before
next week, but I hope that the frequency of buildfarm failures is now
low enough that this isn't a critical issue. If that's wrong, let me
know, but from my point of view, even if we eventually chose (3a),
having a good a sense as possible of what the potential failure modes
are here would help to design the next solution, and AFAIK this is the
first failure we've seen since the DO_NOT_SCAN stuff went in.

(In fact, I had a little bit of trouble finding this in the BF results
even knowing it was there: filtering by test_plan_advice failures
doesn't find anything recent. sifaka's failure shows up as
TestModulesCheck-en_US.UTF-8, but frustratingly, the names for the
stage logs don't seem to quite match the name of what failed. There is
testmodules-install-check-C and
testmodules-install-check-en_US.UTF-8, but those have "install" in the
name and are punctuated differently, so it's not instantly clear that
it's the same thing. Anyway, I do see it in there now, but what I'm
saying is that if there have been other failures that are related to
this, it's possible I have missed them due to stuff like this, so it's
helpful that you (Tom) pointed this one out.)

Tom, would welcome your thoughts, if you have any, and anyone else's
thoughts as well. If none, I'll proceed as described above and update
when I know more.

Thanks,

--=20
Robert Haas
EDB: http://www.enterprisedb.com