Re: Is there value in having optimizer stats for joins/foreignkeys?

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Is there value in having optimizer stats for joins/foreignkeys?
6+ messages / 4 participants
[nested] [flat]

* Re: Is there value in having optimizer stats for joins/foreignkeys?
@ 2026-02-02 09:53  Andrei Lepikhov <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Andrei Lepikhov @ 2026-02-02 09:53 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; Alexandra Wang <[email protected]>; Corey Huinker <[email protected]>; +Cc: Tom Lane <[email protected]>; [email protected]; [email protected]; Jeff Davis <[email protected]>

On 1/2/26 17:39, Tomas Vondra wrote:
> We can't simply store an opaque VIEW, and build the stats by simply
> executing it (and sampling the results). The whole premise of extended
> stats is that people define them to fix incorrect estimates. And with
> incorrect estimates the plan may be terrible, and the VIEW may not even
> complete.

Ok, I got the point.
I think linking to a join or foreign key seems restrictive. In my mind, 
extended statistics may go the following way:

CREATE STATISTICS abc_stat ON (t1.x,t2.y,t3.z) FROM t1,t2,t3;

Suppose t1.x,t2.y, and t3.z have a common equality operator.

Here we can build statistics on (t1.x = t2.y), (t1.x = t3.z), (t2.y = 
t3.z), and potentially (t1.x = t2.y = t3.z).

But I don't frequently detect problems with JOIN estimation using a 
single join clause. Usually, we have problems with (I) join trees 
(clauses spread across joins) and (II) a single multi-clause join.
We can't solve (I) here (kinda statistics on a VIEW might help, I 
think), but may ease (II) using:

CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2;

or even more bravely:

CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2
WHERE (t1.z <> t2.z);

-- 
regards, Andrei Lepikhov,
pgEdge

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Is there value in having optimizer stats for joins/foreignkeys?
@ 2026-02-02 15:57  Tomas Vondra <[email protected]>
  parent: Andrei Lepikhov <[email protected]>
  0 siblings, 2 replies; 6+ messages in thread

From: Tomas Vondra @ 2026-02-02 15:57 UTC (permalink / raw)
  To: Andrei Lepikhov <[email protected]>; Alexandra Wang <[email protected]>; Corey Huinker <[email protected]>; +Cc: Tom Lane <[email protected]>; [email protected]; [email protected]; Jeff Davis <[email protected]>

On 2/2/26 10:53, Andrei Lepikhov wrote:
> On 1/2/26 17:39, Tomas Vondra wrote:
>> We can't simply store an opaque VIEW, and build the stats by simply
>> executing it (and sampling the results). The whole premise of extended
>> stats is that people define them to fix incorrect estimates. And with
>> incorrect estimates the plan may be terrible, and the VIEW may not even
>> complete.
> 
> Ok, I got the point.
> I think linking to a join or foreign key seems restrictive. In my mind,
> extended statistics may go the following way:
> 

I agree we don't need to restrict to joins on foreign keys. I assume the
PoC patch requires f-keys because it makes building the join sample much
simpler / easier to think about.

The paper "sampling done right" paper I mentioned explains how to sample
general joins, as long as there are appropriate indexes. Foreign keys
always have those, but the constraint itself is not needed.

FWIW I think it's perfectly acceptable to allow extended stats only on
joins with appropriate indexes. Because without that we can't do the
sampling efficiently (or possibly at all).

But I'd leave this for later. For now it's perfectly fine to limit the
scope to FK joins, and then maybe expand the scope once we figure out
the other pieces.

> CREATE STATISTICS abc_stat ON (t1.x,t2.y,t3.z) FROM t1,t2,t3;
> 
> Suppose t1.x,t2.y, and t3.z have a common equality operator.
> 
> Here we can build statistics on (t1.x = t2.y), (t1.x = t3.z), (t2.y =
> t3.z), and potentially (t1.x = t2.y = t3.z).
> 

If I understand correctly you suggest we generate all "possible" joins
joining on the columns specified in the ON clause.

I think we shouldn't do that, as it's confused about the purpose of the
ON clause. That's meant to specify the list of columns on which to build
the extended statistic, but now it would be generating join clauses.

It has to be possible to have non-join attributes in the ON clause,
because that's how we can track correlation between the tables. Which is
the whole point, I think. It's not about the join clause selectivity, or
at least not just about it.

Moreover, wouldn't it be rather inefficient? Imagine you have a join
with two tables and two join clauses. (t1.a = t2.a) AND (t1.b = t2.b).
But with your syntax it'd be just

  CREATE STATISTICS s ON (t1.a, t1.b, t2.a, t2.b) FROM t1, t2;

And we'd have to build stats for (at least) 2 joins, because we have no
idea if t1.a joins to t2.a or t2.b.

So -1 to this, IMHO we need the "full" syntax with

 CREATE STATISTICS s ON (t1.c, t2.d)
                   FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b);

We may need some additional statistics to track the selectivity of the
join clauses, in addition to the existing MCV stats built on the join
result.

> But I don't frequently detect problems with JOIN estimation using a
> single join clause. Usually, we have problems with (I) join trees
> (clauses spread across joins) and (II) a single multi-clause join.
> We can't solve (I) here (kinda statistics on a VIEW might help, I
> think), but may ease (II) using:
> 
> CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2;
> 
> or even more bravely:
> 
> CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2
> WHERE (t1.z <> t2.z);
> 

I honestly don't see why this would be better / simpler than the CREATE
STATISTICS grammar that simply allows joins in the FROM part.

regards

-- 
Tomas Vondra

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Is there value in having optimizer stats for joins/foreignkeys?
@ 2026-05-21 20:25  Tom Lane <[email protected]>
  parent: Tomas Vondra <[email protected]>
  1 sibling, 1 reply; 6+ messages in thread

From: Tom Lane @ 2026-05-21 20:25 UTC (permalink / raw)
  To: Alexandra Wang <[email protected]>; +Cc: jian he <[email protected]>; [email protected]; Tomas Vondra <[email protected]>; Andrei Lepikhov <[email protected]>; Corey Huinker <[email protected]>; [email protected]; Jeff Davis <[email protected]>

Alexandra Wang <[email protected]> writes:
> Here's v7, another attempt to fix the unstable tests.

Hi Alexandra,

I signed up for an in-person review of this at PGConf.dev, but
the schedule doesn't seem to be working in favor of making that
happen.  If you see this and happen to run into me in the
hallway, I'm happy to chat, but in any case here are my
rather-hasty review notes.

I think it's okay if v1 only handles 2-way joins, as long as the
catalog representation is prepared for more.  Restricting to
cases where we can do index-based sampling seems fine too.
Those things could be relaxed later if it seems worthwhile,
but we'd have a creditable feature even without.

I didn't read the sampling code in any detail.  I think you will
need to put more thought into what is user-friendly behavior
in case the required index doesn't exist or doesn't have the
right properties.  (I think the tests for that might not be
strong enough, either.)

I think you could simplify some code noticeably if you included the
anchor rel's OID as the first element of stxjoinrels[].  Yeah,
it'd be redundant with stxrelid, but so what?  It's not like 
pg_statistic_ext rows are narrow enough that anyone would notice
the extra 4 bytes.  I think this would simplify some of the
relationships within the data structures, too, eg all varnos in
the expressions could be considered to reference stxjoinrels[].

I don't love stxkeyrefs[].  I wonder if it's time to throw away
stxkeys[], represent all the target columns as regular expression
trees in stxexprs, and then special-case columns that are simple
Vars where appropriate at execution.

(In the same vein, I dislike the grammar's separation of plain
columns from expressions; I'd like to replace stats_params
with expr_list and sort it all out later.  But perhaps that's
material for a separate patch.)

We will need to put more thought into permissions: I don't think
requiring all the tables to have the same owner is workable.
(What happens if someone tries to ALTER OWNER later?)  However,
if they don't all have the same owner, there are potential security
problems, so the right restriction is not obvious.  This is not
necessary to solve now; there are bigger questions to worry about.
But we'll need an answer before it's committable.

It's not too soon to write some user-facing documentation.
CREATE STATISTICS man page obviously needs attention, but
also the discussion of extended stats in perform.sgml.
And catalogs.sgml.  I find that writing that sort of stuff
helps to clarify where one's design is weak.

			regards, tom lane

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Is there value in having optimizer stats for joins/foreignkeys?
@ 2026-05-25 12:15  Chengpeng Yan <[email protected]>
  parent: Tomas Vondra <[email protected]>
  1 sibling, 0 replies; 6+ messages in thread

From: Chengpeng Yan @ 2026-05-25 12:15 UTC (permalink / raw)
  To: Alexandra Wang <[email protected]>; +Cc: jian he <[email protected]>; [email protected] <[email protected]>; Tomas Vondra <[email protected]>; Andrei Lepikhov <[email protected]>; Corey Huinker <[email protected]>; Tom Lane <[email protected]>; [email protected] <[email protected]>; Jeff Davis <[email protected]>

Hi,

> On May 15, 2026, at 23:30, Alexandra Wang <[email protected]> wrote:
> 
> Here's v7, another attempt to fix the unstable tests.

Thanks for working on this. I have a few design comments about the lifecycle
and the intended scope of the join statistic.

First, the index dependency contract seems worth clarifying. CREATE STATISTICS
records a normal dependency on the selected index, so DROP INDEX is blocked
unless CASCADE is used, although that index is not part of the statistics
definition. But ANALYZE appears to re-discover a suitable index when refreshing
the statistic. Is the index intended to be part of the statistic's persistent
contract, or only a creation-time proof that index-based sampling is possible?
If the latter, should DROP INDEX still be blocked when another equivalent index
exists?

Second, this seems related to the earlier concern that ANALYZE is per-table.
The statistic is owned by the anchor relation, but its contents depend on the
probed relation too. In the current patch, ANALYZE on the probed relation can
refresh its own statistics without refreshing the join statistic. If the
probed relation has changed substantially, that leaves a possible staleness
gap where the planner combines fresh base-table statistics with stale
cross-relation skew information.

Third, the contract for non-unique indexes on the probed side seems worth
clarifying. The comments define raw_sel as anchor-relative:
P(join AND covered_filters) / anchor_totalrows, roughly Jf / anchor_totalrows,
where Jf is the number of joined rows satisfying the covered filters. But the
implementation computes raw_sel from MCV frequencies. Since the MCV list is
built from sampled joined rows, a plain MCV frequency is naturally measured
inside that joined sample, roughly Jf / J where J is the sampled join-result
size. It would be useful to clarify when those two quantities are expected to
be equivalent, especially when a non-unique probed-side index allows one
anchor row to contribute multiple joined rows.

These two measures are close in FK-like cases where the joined sample size
tracks the anchor sample size. With a non-unique lookup, one anchor row may
appear many times in the joined sample. For example, if one key matches many
red rows and another key matches only one blue row, red may dominate the
joined sample because of match multiplicity. That frequency describes the
distribution within the joined result, but not how many matching joined rows
are produced per anchor row. It would be useful to state whether such
one-to-many joins are outside the current supported scope, or how the
MCV-derived raw_sel accounts for how many joined rows each anchor row
contributed before it is converted into planner join selectivity.

--
Best regards,
Chengpeng Yan

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Is there value in having optimizer stats for joins/foreignkeys?
@ 2026-05-25 14:34  Tomas Vondra <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Tomas Vondra @ 2026-05-25 14:34 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; Alexandra Wang <[email protected]>; +Cc: jian he <[email protected]>; [email protected]; Andrei Lepikhov <[email protected]>; Corey Huinker <[email protected]>; [email protected]; Jeff Davis <[email protected]>

On 5/21/26 22:25, Tom Lane wrote:
> Alexandra Wang <[email protected]> writes:
>> Here's v7, another attempt to fix the unstable tests.
> 
> Hi Alexandra,
> 
> I signed up for an in-person review of this at PGConf.dev, but
> the schedule doesn't seem to be working in favor of making that
> happen.  If you see this and happen to run into me in the
> hallway, I'm happy to chat, but in any case here are my
> rather-hasty review notes.
> 
> I think it's okay if v1 only handles 2-way joins, as long as the
> catalog representation is prepared for more.  Restricting to
> cases where we can do index-based sampling seems fine too.
> Those things could be relaxed later if it seems worthwhile,
> but we'd have a creditable feature even without.
> 

+1

My assumption is allowing larger joins would be a somewhat mechanical.
I'm not aware of additional problems on top of 2-way joins.

I think we should aim to support larger joins. At the unconference there
were suggestions maybe it would be enough to support 2-way joins, but
it's not hard to construct cases where that's no sufficient. Consider a
fact table, joining to two dimensions. It's common for the dimensions to
be correlated in various ways, and 2-way joins can't handle these cases.

> I didn't read the sampling code in any detail.  I think you will
> need to put more thought into what is user-friendly behavior
> in case the required index doesn't exist or doesn't have the
> right properties.  (I think the tests for that might not be
> strong enough, either.)
> 

What would be the most "user-friendly behavior" in this case?

I think we can either (a) refuse defining/building the join statistics
in this case, or (b) fallback to sampling not requiring an index (but
then it'll be way more expensive).

I think (a) should be fine for now, i.e. we should require an index.
Most of the joins will be on FK constraints, or something like that (the
FK may not be defined, but there will be a PK on one side).

Not sure if we should simply refuse building the stats, or if it's
enough to detect this while building the statistics. I'd say enforcing
this during DDL (CREATE STATISTICS, DROP INDEX, ...) is better,
otherwise users may not notice the statistics stopped building.

> I think you could simplify some code noticeably if you included the
> anchor rel's OID as the first element of stxjoinrels[].  Yeah,
> it'd be redundant with stxrelid, but so what?  It's not like 
> pg_statistic_ext rows are narrow enough that anyone would notice
> the extra 4 bytes.  I think this would simplify some of the
> relationships within the data structures, too, eg all varnos in
> the expressions could be considered to reference stxjoinrels[].
> 
> I don't love stxkeyrefs[].  I wonder if it's time to throw away
> stxkeys[], represent all the target columns as regular expression
> trees in stxexprs, and then special-case columns that are simple
> Vars where appropriate at execution.
> 

+1, I don't see a reason to not store the anchor rel separately.

> (In the same vein, I dislike the grammar's separation of plain
> columns from expressions; I'd like to replace stats_params
> with expr_list and sort it all out later.  But perhaps that's
> material for a separate patch.)
> 

FWIW the extended stats copied this from pg_index, which also stores
keys and expressions separately. I suppose there was a reason for that,
most likely performance - is cheaper to compare attnums than
expressions, and plain keys are much more common.

Maybe that's no longer true, or maybe it's not as important for extended
stats (there's likely fewer of those, compared to indexes).

> We will need to put more thought into permissions: I don't think
> requiring all the tables to have the same owner is workable.
> (What happens if someone tries to ALTER OWNER later?)  However,
> if they don't all have the same owner, there are potential security
> problems, so the right restriction is not obvious.  This is not
> necessary to solve now; there are bigger questions to worry about.
> But we'll need an answer before it's committable.
> 

I have not thought about this at all, but what can we do if the tables
have different owners? I suppose we could require the stxowner to have
SELECT privilege on the joined relations (instead of owning them).

> It's not too soon to write some user-facing documentation.
> CREATE STATISTICS man page obviously needs attention, but
> also the discussion of extended stats in perform.sgml.
> And catalogs.sgml.  I find that writing that sort of stuff
> helps to clarify where one's design is weak.
> 

+1

regards

-- 
Tomas Vondra

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Is there value in having optimizer stats for joins/foreignkeys?
@ 2026-05-25 15:03  Tom Lane <[email protected]>
  parent: Tomas Vondra <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: Tom Lane @ 2026-05-25 15:03 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Alexandra Wang <[email protected]>; jian he <[email protected]>; [email protected]; Andrei Lepikhov <[email protected]>; Corey Huinker <[email protected]>; [email protected]; Jeff Davis <[email protected]>

Tomas Vondra <[email protected]> writes:
> On 5/21/26 22:25, Tom Lane wrote:
>> I don't love stxkeyrefs[].  I wonder if it's time to throw away
>> stxkeys[], represent all the target columns as regular expression
>> trees in stxexprs, and then special-case columns that are simple
>> Vars where appropriate at execution.
>> (In the same vein, I dislike the grammar's separation of plain
>> columns from expressions; I'd like to replace stats_params
>> with expr_list and sort it all out later.  But perhaps that's
>> material for a separate patch.)

> FWIW the extended stats copied this from pg_index, which also stores
> keys and expressions separately. I suppose there was a reason for that,
> most likely performance - is cheaper to compare attnums than
> expressions, and plain keys are much more common.

I think I might be to blame for the separate storage of indexprs.
If so, the motivation was to avoid breakage of older code that only
knew about indkey[].  (Of course, such code would necessarily fail
on indexes with expressions, but we wanted to avoid breakage for the
common case of no-expressions.)  I don't think that consideration is
nearly as pressing for extended stats.  There's probably a lot less
client-side code that knows about extended stats at all, and what
there is seems more likely to rely on the server-side display
functions than to dig into the catalog details for itself.  Also,
if there is anything that's looking at pg_statistic_ext details,
it will need work anyway after this patch; there's no way around that.

>> We will need to put more thought into permissions: I don't think
>> requiring all the tables to have the same owner is workable.

> I have not thought about this at all, but what can we do if the tables
> have different owners? I suppose we could require the stxowner to have
> SELECT privilege on the joined relations (instead of owning them).

Yeah, the rough idea I had was to require ownership on the anchor
table and SELECT on the rest.  But it's not terribly clear what
to do if that SELECT privilege gets revoked.

I also wonder to what extent we have a problem with users of the
anchor table being able to infer something about the contents of the
other tables via plan choices, and whether it matters if they can.
They may well be able to make the same inferences anyway from query
results.  For that matter, if a user is able to issue a query for
which a set of extended join stats is relevant, it seems likely that
that must mean she has SELECT on the other tables anyway.  But
maybe I'm missing some case where that wouldn't be true.

			regards, tom lane

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2026-05-25 15:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-02 09:53 Re: Is there value in having optimizer stats for joins/foreignkeys? Andrei Lepikhov <[email protected]>
2026-02-02 15:57 ` Tomas Vondra <[email protected]>
2026-05-21 20:25   ` Tom Lane <[email protected]>
2026-05-25 14:34     ` Tomas Vondra <[email protected]>
2026-05-25 15:03       ` Tom Lane <[email protected]>
2026-05-25 12:15   ` Chengpeng Yan <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox