public inbox for [email protected]  
help / color / mirror / Atom feed
Re: generic plans and "initial" pruning
4+ messages / 3 participants
[nested] [flat]

* Re: generic plans and "initial" pruning
@ 2022-01-12 14:31 Amit Langote <[email protected]>
  2022-01-12 18:20 ` Re: generic plans and "initial" pruning Robert Haas <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Amit Langote @ 2022-01-12 14:31 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: pgsql-hackers

Thanks for taking the time to look at this.

On Wed, Jan 12, 2022 at 1:22 AM Robert Haas <[email protected]> wrote:
> On Fri, Dec 24, 2021 at 10:36 PM Amit Langote <[email protected]> wrote:
> > However, using an idea that Robert suggested to me off-list a little
> > while back, it seems possible to determine the set of partitions that
> > we can safely skip locking.  The idea is to look at the "initial" or
> > "pre-execution" pruning instructions contained in a given Append or
> > MergeAppend node when AcquireExecutorLocks() is collecting the
> > relations to lock and consider relations from only those sub-nodes
> > that survive performing those instructions.   I've attempted
> > implementing that idea in the attached patch.
>
> Hmm. The first question that occurs to me is whether this is fully safe.
>
> Currently, AcquireExecutorLocks calls LockRelationOid for every
> relation involved in the query. That means we will probably lock at
> least one relation on which we previously had no lock and thus
> AcceptInvalidationMessages(). That will end up marking the query as no
> longer valid and CheckCachedPlan() will realize this and tell the
> caller to replan. In the corner case where we already hold all the
> required locks, we will not accept invalidation messages at this
> point, but must have done so after acquiring the last of the locks
> required, and if that didn't mark the plan invalid, it can't be
> invalid now either. Either way, everything is fine.
>
> With the proposed patch, we might never lock some of the relations
> involved in the query. Therefore, if one of those relations has been
> modified in some way that would invalidate the plan, we will
> potentially fail to discover this, and will use the plan anyway.  For
> instance, suppose there's one particular partition that has an extra
> index and the plan involves an Index Scan using that index. Now
> suppose that the scan of the partition in question is pruned, but
> meanwhile, the index has been dropped. Now we're running a plan that
> scans a nonexistent index. Admittedly, we're not running that part of
> the plan. But is that enough for this to be safe? There are things
> (like EXPLAIN or auto_explain) that we might try to do even on a part
> of the plan tree that we don't try to run. Those things might break,
> because for example we won't be able to look up the name of an index
> in the catalogs for EXPLAIN output if the index is gone.
>
> This is just a relatively simple example and I think there are
> probably a bunch of others. There are a lot of kinds of DDL that could
> be performed on a partition that gets pruned away: DROP INDEX is just
> one example. The point is that to my knowledge we have no existing
> case where we try to use a plan that might be only partly valid, so if
> we introduce one, there's some risk there. I thought for a while, too,
> about whether changes to some object in a part of the plan that we're
> not executing could break things for the rest of the plan even if we
> never do anything with the plan but execute it. I can't quite see any
> actual hazard. For example, I thought about whether we might try to
> get the tuple descriptor for the pruned-away object and get a
> different tuple descriptor than we were expecting. I think we can't,
> because (1) the pruned object has to be a partition, and tuple
> descriptors have to match throughout the partitioning hierarchy,
> except for column ordering, which currently can't be changed
> after-the-fact and (2) IIRC, the tuple descriptor is stored in the
> plan and not reconstructed at runtime and (3) if we don't end up
> opening the relation because it's pruned, then we certainly can't do
> anything with its tuple descriptor. But it might be worth giving more
> thought to the question of whether there's any other way we could be
> depending on the details of an object that ended up getting pruned.

I have pondered on the possible hazards before writing the patch,
mainly because the concerns about a previously discussed proposal were
along similar lines [1].

IIUC, you're saying the plan tree is subject to inspection by non-core
code before ExecutorStart() has initialized a PlanState tree, which
must have discarded pruned portions of the plan tree.  I wouldn't
claim to have scanned *all* of the core code that could possibly
access the invalidated portions of the plan tree, but from what I have
seen, I couldn't find any site that does.  An ExecutorStart_hook()
gets to do that, but from what I can see it is expected to call
standard_ExecutorStart() before doing its thing and supposedly only
looks at the PlanState tree, which must be valid.  Actually, EXPLAIN
also does ExecutorStart() before starting to look at the plan (the
PlanState tree), so must not run into pruned plan tree nodes.  All
that said, it does sound like wishful thinking to say that no problems
can possibly occur.

At first, I had tried to implement this such that the
Append/MergeAppend nodes are edited to record the result of initial
pruning, but it felt wrong to be munging the plan tree in plancache.c.

Or, maybe this won't be a concern if performing ExecutorStart() is
made a part of CheckCachedPlan() somehow, which would then take locks
on the relation as the PlanState tree is built capturing any plan
invalidations, instead of AcquireExecutorLocks(). That does sound like
an ambitious undertaking though.

> > Note that "initial" pruning steps are now performed twice when
> > executing generic plans: once in AcquireExecutorLocks() to find
> > partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
> > determine the set of partition sub-nodes to be initialized for
> > execution, though I wasn't able to come up with a good idea to avoid
> > this duplication.
>
> I think this is something that will need to be fixed somehow. Apart
> from the CPU cost, it's scary to imagine that the set of nodes on
> which we acquired locks might be different from the set of nodes that
> we initialize. If we do the same computation twice, there must be some
> non-zero probability of getting a different answer the second time,
> even if the circumstances under which it would actually happen are
> remote. Consider, for example, a function that is labeled IMMUTABLE
> but is really VOLATILE. Now maybe you can get the system to lock one
> set of partitions and then initialize a different set of partitions. I
> don't think we want to try to reason about what consequences that
> might have and prove that somehow it's going to be OK; I think we want
> to nail the door shut very tightly to make sure that it can't.

Yeah, the premise of the patch is that "initial" pruning steps produce
the same result both times.  I assumed that would be true because the
pruning steps are not allowed to contain any VOLATILE expressions.
Regarding the possibility that IMMUTABLE labeling of functions may be
incorrect, I haven't considered if the runtime pruning code can cope
or whether it should try to.  If such a case does occur in practice,
the bad outcome would be an Assert failure in
ExecGetRangeTableRelation() or using a partition unlocked in the
non-assert builds, the latter of which feels especially bad.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/CA%2BTgmoZN-80143F8OhN8Cn5-uDae5miLYVwMapAuc%2B7%2BZ7pyNg%40ma...






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: generic plans and "initial" pruning
  2022-01-12 14:31 Re: generic plans and "initial" pruning Amit Langote <[email protected]>
@ 2022-01-12 18:20 ` Robert Haas <[email protected]>
  2022-02-13 21:55   ` Re: generic plans and "initial" pruning David Rowley <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Robert Haas @ 2022-01-12 18:20 UTC (permalink / raw)
  To: Amit Langote <[email protected]>; +Cc: pgsql-hackers

On Wed, Jan 12, 2022 at 9:32 AM Amit Langote <[email protected]> wrote:
> I have pondered on the possible hazards before writing the patch,
> mainly because the concerns about a previously discussed proposal were
> along similar lines [1].

True. I think that the hazards are narrower with this proposal,
because if you *delay* locking a partition that you eventually need,
then you might end up trying to actually execute a portion of the plan
that's no longer valid. That seems like hopelessly bad news. On the
other hand, with this proposal, you skip locking altogether, but only
for parts of the plan that you don't plan to execute. That's still
kind of scary, but not to nearly the same degree.

> IIUC, you're saying the plan tree is subject to inspection by non-core
> code before ExecutorStart() has initialized a PlanState tree, which
> must have discarded pruned portions of the plan tree.  I wouldn't
> claim to have scanned *all* of the core code that could possibly
> access the invalidated portions of the plan tree, but from what I have
> seen, I couldn't find any site that does.  An ExecutorStart_hook()
> gets to do that, but from what I can see it is expected to call
> standard_ExecutorStart() before doing its thing and supposedly only
> looks at the PlanState tree, which must be valid.  Actually, EXPLAIN
> also does ExecutorStart() before starting to look at the plan (the
> PlanState tree), so must not run into pruned plan tree nodes.  All
> that said, it does sound like wishful thinking to say that no problems
> can possibly occur.

Yeah. I don't think it's only non-core code we need to worry about
either. What if I just do EXPLAIN ANALYZE on a prepared query that
ends up pruning away some stuff? IIRC, the pruned subplans are not
shown, so we might escape disaster here, but FWIW if I'd committed
that code I would have pushed hard for showing those and saying "(not
executed)" .... so it's not too crazy to imagine a world in which
things work that way.

> At first, I had tried to implement this such that the
> Append/MergeAppend nodes are edited to record the result of initial
> pruning, but it felt wrong to be munging the plan tree in plancache.c.

It is. You can't munge the plan tree: it's required to be strictly
read-only once generated. It can be serialized and deserialized for
transmission to workers, and it can be shared across executions.

> Or, maybe this won't be a concern if performing ExecutorStart() is
> made a part of CheckCachedPlan() somehow, which would then take locks
> on the relation as the PlanState tree is built capturing any plan
> invalidations, instead of AcquireExecutorLocks(). That does sound like
> an ambitious undertaking though.

On the surface that would seem to involve abstraction violations, but
maybe that could be finessed somehow. The plancache shouldn't know too
much about what the executor is going to do with the plan, but it
could ask the executor to perform a step that has been designed for
use by the plancache. I guess the core problem here is how to pass
around information that is node-specific before we've stood up the
executor state tree. Maybe the executor could have a function that
does the pruning and returns some kind of array of results that can be
used both to decide what to lock and also what to consider as pruned
at the start of execution. (I'm hand-waving about the details because
I don't know.)

> Yeah, the premise of the patch is that "initial" pruning steps produce
> the same result both times.  I assumed that would be true because the
> pruning steps are not allowed to contain any VOLATILE expressions.
> Regarding the possibility that IMMUTABLE labeling of functions may be
> incorrect, I haven't considered if the runtime pruning code can cope
> or whether it should try to.  If such a case does occur in practice,
> the bad outcome would be an Assert failure in
> ExecGetRangeTableRelation() or using a partition unlocked in the
> non-assert builds, the latter of which feels especially bad.

Right. I think it's OK for a query to produce wrong answers under
those kinds of conditions - the user has broken everything and gets to
keep all the pieces - but doing stuff that might violate fundamental
assumptions of the system like "relations can only be accessed when
holding a lock on them" feels quite bad. It's not a stretch to imagine
that failing to follow those invariants could take the whole system
down, which is clearly too severe a consequence for the user's failure
to label things properly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: generic plans and "initial" pruning
  2022-01-12 14:31 Re: generic plans and "initial" pruning Amit Langote <[email protected]>
  2022-01-12 18:20 ` Re: generic plans and "initial" pruning Robert Haas <[email protected]>
@ 2022-02-13 21:55   ` David Rowley <[email protected]>
  2022-02-14 20:17     ` Re: generic plans and "initial" pruning Robert Haas <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: David Rowley @ 2022-02-13 21:55 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Amit Langote <[email protected]>; pgsql-hackers

(just catching up on this thread)

On Thu, 13 Jan 2022 at 07:20, Robert Haas <[email protected]> wrote:
> Yeah. I don't think it's only non-core code we need to worry about
> either. What if I just do EXPLAIN ANALYZE on a prepared query that
> ends up pruning away some stuff? IIRC, the pruned subplans are not
> shown, so we might escape disaster here, but FWIW if I'd committed
> that code I would have pushed hard for showing those and saying "(not
> executed)" .... so it's not too crazy to imagine a world in which
> things work that way.

FWIW, that would remove the whole point in init run-time pruning.  The
reason I made two phases of run-time pruning was so that we could get
away from having the init plan overhead of nodes we'll never need to
scan.  If we wanted to show the (never executed) scans in EXPLAIN then
we'd need to do the init plan part and allocate all that memory
needlessly.

Imagine a hash partitioned table on "id" with 1000 partitions. The user does:

PREPARE q1 (INT) AS SELECT * FROM parttab WHERE id = $1;

EXECUTE q1(123);

Assuming a generic plan, if we didn't have init pruning then we have
to build a plan containing the scans for all 1000 partitions. There's
significant overhead to that compared to just locking the partitions,
and initialising 1 scan.

If it worked this way then we'd be even further from Amit's goal of
reducing the overhead of starting plan with run-time pruning nodes.

I understood at the time it was just the EXPLAIN output that you had
concerns with. I thought that was just around the lack of any display
of the condition we used for pruning.

David






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: generic plans and "initial" pruning
  2022-01-12 14:31 Re: generic plans and "initial" pruning Amit Langote <[email protected]>
  2022-01-12 18:20 ` Re: generic plans and "initial" pruning Robert Haas <[email protected]>
  2022-02-13 21:55   ` Re: generic plans and "initial" pruning David Rowley <[email protected]>
@ 2022-02-14 20:17     ` Robert Haas <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: Robert Haas @ 2022-02-14 20:17 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Amit Langote <[email protected]>; pgsql-hackers

On Sun, Feb 13, 2022 at 4:55 PM David Rowley <[email protected]> wrote:
> FWIW, that would remove the whole point in init run-time pruning.  The
> reason I made two phases of run-time pruning was so that we could get
> away from having the init plan overhead of nodes we'll never need to
> scan.  If we wanted to show the (never executed) scans in EXPLAIN then
> we'd need to do the init plan part and allocate all that memory
> needlessly.

Interesting. I didn't realize that was why it had ended up like this.

> I understood at the time it was just the EXPLAIN output that you had
> concerns with. I thought that was just around the lack of any display
> of the condition we used for pruning.

That was part of it, but I did think it was surprising that we didn't
print anything at all about the nodes we pruned, too. Although we're
technically iterating over the PlanState, from the user perspective it
feels like you're asking PostgreSQL to print out the plan - so it
seems weird to have nodes in the Plan tree that are quietly omitted
from the output. That said, perhaps in retrospect it's good that it
ended up as it did, since we'd have a lot of trouble printing anything
sensible for a scan of a table that's since been dropped.

-- 
Robert Haas
EDB: http://www.enterprisedb.com






^ permalink  raw  reply  [nested|flat] 4+ messages in thread


end of thread, other threads:[~2022-02-14 20:17 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2022-01-12 14:31 Re: generic plans and "initial" pruning Amit Langote <[email protected]>
2022-01-12 18:20 ` Robert Haas <[email protected]>
2022-02-13 21:55   ` David Rowley <[email protected]>
2022-02-14 20:17     ` Robert Haas <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox