Re: Is there value in having optimizer stats for joins/foreignkeys?

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Tomas Vondra <[email protected]>
To: Corey Huinker <[email protected]>
Cc: Alexandra Wang <[email protected]>
Cc: Tom Lane <[email protected]>
Cc: jian he <[email protected]>
Cc: [email protected]
Cc: Andrei Lepikhov <[email protected]>
Cc: [email protected]
Cc: Jeff Davis <[email protected]>
Subject: Re: Is there value in having optimizer stats for joins/foreignkeys?
Date: Thu, 28 May 2026 03:29:15 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <CADkLM=dabcDYS1hqbZbeVsqeTH_AyD0tPvsvc77HOk=mMsVhOg@mail.gmail.com>
References: <CADkLM=cUwMftPLFq0iD6-qKRyNiRM2HZGYVp6=0noxA8GfuEtA@mail.gmail.com>
	<[email protected]>
	<CAK98qZ2ySno00SApwj3X_MgN4iuBzKaXKXOY+U3jaFqTxPS8Tg@mail.gmail.com>
	<CAK98qZ0Vr4D3usesj_qpNFAH1=7NKX=bR9bNRNM7tdnk5KWHDw@mail.gmail.com>
	<CAK98qZ2TveWY34VBw8LmQUEq2GUrPLhsaYYORgDAA5VZybyS5w@mail.gmail.com>
	<CAK98qZ3TGXKcudcnyEhGuMUiru36m=Vm=Zm5Gyug0h+VyxmD0w@mail.gmail.com>
	<CACJufxFOe4rx=J+0+=_g+K=bnhxs1OUdMJCgQ+tV0KdkQ_2aeQ@mail.gmail.com>
	<CAK98qZ2mMJ3Nvx9U1S9N9Put1bb=iOgy8vFdp=rfjfta4wJ2AQ@mail.gmail.com>
	<CAK98qZ3ASXEBuftvd=SxPwLOXgo=2-88SyZyLxXf3k4HfDKKsw@mail.gmail.com>
	<CAK98qZ397f5xEkNgqEoeZTbXBnr7h665wZMvGJOnok+ch+XWSA@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAK98qZ0_4Kodyemk0Tdmew=YG8jeHQ=wzOydQws4s99A1X2h5g@mail.gmail.com>
	<[email protected]>
	<CADkLM=dabcDYS1hqbZbeVsqeTH_AyD0tPvsvc77HOk=mMsVhOg@mail.gmail.com>

On 5/28/26 00:02, Corey Huinker wrote:
>     Also, maybe this is one more argument against the "optimizer view" idea
>     Corey mentioned in Vancouver last week? Because surely we'll want to
>     include the join statistics in export/import, and for the view approach
>     we'd need to invent a fair amount of code to do that. Maybe?
> 
> 
> We definitely want join statistics in the export/import. 
> 
> To my mind, having stats on certain views would be extremely simple for
> export/import, we'd simply have one more relkind that makes it into the
> system views pg_stats and pg_stats_ext, and the statistics for that new
> relation would plug into pg_statistic with no catalog change to
> pg_statistic whatsoever. Additionally, allowing certain kinds of views
> (or a relation relkind that's functionally equivalent to a view) to have
> statistics makes it easy to define extended statistics on those views,
> with no catalog change to pg_statistic_ext.
> 

Ah, so you're proposing supporting CREATE STATISTICS on some views? I
guess that's one way to support stats on joins, without having to rework
the schema. I'm still not a huge fan of it, because it just uses views
as a workaround to store the join definition, nothing else. To me it
seems a weird to require creating a new relation just for this, and we
already envisioned CREATE STATISTICS would cover joins. My guess is that
may be why DB2 did it this way, as they probably didn't have anything
like extended stats at that point.

> Having something in pg_class to anchor existing per-attribute and multi-
> attribute stats off of seems like a big win to me. We'd get all pre-
> existing statistics types for free, statistics kinds provided by
> extensions for free, extended stats for free, import and export for
> free. Well, not free, we just have to remove the exclusion that views
> (or whatever relkind we create) can't have stats.
> 

TBH the grammar / catalog stuff seems like a relatively minor part of
this patch. To me, the difficult part seems to be the sampling / analyze
part, and then matching it to the query during planning. And all of this
seems exactly the same no matter how the stats are defined.

OTOH maybe allowing CREATE STATISTICS on views would be independently
useful too, once we have the later parts. Not sure.

> Having said all that, the statistics import code reads from
> pg_statistic_ext but obviously never modifies it, it's all about
> constructing the pg_statistic_ext_data row, and it need to only concern
> itself with importing to the current version, so internal catalog
> changes to pg_statistic_ext wouldn't be that big of a deal.
> 
> If, however, we want to stick with the notion that all non-relation,
> not-attribute stats are extended stats, then we have to remodel a bit:
> 
> - as Tom and Tomas mentioned, we'd probably want to dispense with the
> negative attnums and stxexprs, as those are already a bit of a pain to
> deal with. That might be worth of a patch even without join statistics.
> 
> - we'd want a way to express stats for all the individual columns/
> expressions defined in the join stats object, i.e. "what is the MCV of
> B.name for rows joined by A on A.b_id"?
> 

Maybe I misunderstand, but don't we already do this for statistics on
expressions?

> - we'd want some way of excluding the non-interesting combinations of
> columns. If we had a stat on A.qty, A.price, B.cust_name, C.sale_date,
> D.item_name, then we'd have 5-factorial MCV combinations in addition to
> the 5 single-column stats, and the (A.qty, A.price) combinations can
> already be covered by a non-join extended stat.
> 

I don't follow. What 5-factorial combinations? We only ever build a
single MCV for a given statistics object.

> It's those things that make optimizer views so attractive to me - we
> already have a way to store individual column stats (pg_statistic) and
> shut them off when not interesting (pg_attribute.attstattarget), and
> extended statistics are a great way to describe which combinations of
> columns are interesting to us.
> 

I may be missing something, but I honestly the only benefit of views I
can think of is already having the join definition in a catalog.

> Typing all this has me thinking that there may be a third way:
> 
> - statistics objects become pg_class objects, stxkeys and stxexprs 
> become pg_attribute rows
> - pg_statisic_ext keeps (stxrelid, stxstattarget, stxkind), adds stxid
> (pointing to new pg_class object)  sort of like pg_index
> - pg_statistic_ext_data remains as-is
> 
> and we'd set per-column stats collection like ALTER STATISTICS foo ALTER
> COLUMN.

I don't follow. Why should statistics object be pg_class objects? In my
mind pg_class is meant for "relations" (as in, table-like things), and
statistics objects are not like that. I suppose this is related to your
earlier suggestion

  Having something in pg_class to anchor existing per-attribute and
  multi-attribute stats off of seems like a big win to me.

but it's not clear to me why would that be? All statistics are tied to a
relation (or multiple relations) in the end, and I don't see why would
the view make anything simpler. What are the wins?

Could you elaborate? It's entirely possible I just don't see something
obvious, or maybe you explained this in Vancouver and I managed to
forget the details. Sorry about that.


regards

-- 
Tomas Vondra

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Is there value in having optimizer stats for joins/foreignkeys?
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox