Message-ID: <d079f515-db2c-43f3-a11f-ceff7b1616c9@vondra.me>
Date: Mon, 2 Feb 2026 16:57:26 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Is there value in having optimizer stats for joins/foreignkeys?
To: Andrei Lepikhov <lepihov@gmail.com>,
 Alexandra Wang <alexandra.wang.oss@gmail.com>,
 Corey Huinker <corey.huinker@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@lists.postgresql.org,
 hs@cybertec.at, Jeff Davis <pgsql@j-davis.com>
References: <CADkLM=cUwMftPLFq0iD6-qKRyNiRM2HZGYVp6=0noxA8GfuEtA@mail.gmail.com>
 <ea04ba03-50dd-4de8-a2bd-111f246a2f30@vondra.me>
 <246035.1764627115@sss.pgh.pa.us>
 <CADkLM=fEi_GeeS3zyg6B5WgswyPe0wNXHfKQOxjy8A5fXHD7=A@mail.gmail.com>
 <CAK98qZ2mW=geT9NKe5vC68-sB9EJe_887uV=MCFt6y9AhyTp7A@mail.gmail.com>
 <CAK98qZ0LwJbUoiZjjFXitojHy4UskkjYDiSd_JZfGE9LbfZm9w@mail.gmail.com>
 <8df3d212-5d60-4e30-9606-d8849f7d37ae@gmail.com>
 <e9165403-7b0e-4c01-96d0-b73512ce359d@vondra.me>
 <3c477f2f-10e4-4705-bb21-90ccbe67e9d2@gmail.com>
Content-Language: en-US
From: Tomas Vondra <tomas@vondra.me>
In-Reply-To: <3c477f2f-10e4-4705-bb21-90ccbe67e9d2@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Archived-At: <https://www.postgresql.org/message-id/d079f515-db2c-43f3-a11f-ceff7b1616c9%40vondra.me>
Precedence: bulk


On 2/2/26 10:53, Andrei Lepikhov wrote:
> On 1/2/26 17:39, Tomas Vondra wrote:
>> We can't simply store an opaque VIEW, and build the stats by simply
>> executing it (and sampling the results). The whole premise of extended
>> stats is that people define them to fix incorrect estimates. And with
>> incorrect estimates the plan may be terrible, and the VIEW may not even
>> complete.
> 
> Ok, I got the point.
> I think linking to a join or foreign key seems restrictive. In my mind,
> extended statistics may go the following way:
> 

I agree we don't need to restrict to joins on foreign keys. I assume the
PoC patch requires f-keys because it makes building the join sample much
simpler / easier to think about.

The paper "sampling done right" paper I mentioned explains how to sample
general joins, as long as there are appropriate indexes. Foreign keys
always have those, but the constraint itself is not needed.

FWIW I think it's perfectly acceptable to allow extended stats only on
joins with appropriate indexes. Because without that we can't do the
sampling efficiently (or possibly at all).

But I'd leave this for later. For now it's perfectly fine to limit the
scope to FK joins, and then maybe expand the scope once we figure out
the other pieces.

> CREATE STATISTICS abc_stat ON (t1.x,t2.y,t3.z) FROM t1,t2,t3;
> 
> Suppose t1.x,t2.y, and t3.z have a common equality operator.
> 
> Here we can build statistics on (t1.x = t2.y), (t1.x = t3.z), (t2.y =
> t3.z), and potentially (t1.x = t2.y = t3.z).
> 

If I understand correctly you suggest we generate all "possible" joins
joining on the columns specified in the ON clause.

I think we shouldn't do that, as it's confused about the purpose of the
ON clause. That's meant to specify the list of columns on which to build
the extended statistic, but now it would be generating join clauses.

It has to be possible to have non-join attributes in the ON clause,
because that's how we can track correlation between the tables. Which is
the whole point, I think. It's not about the join clause selectivity, or
at least not just about it.

Moreover, wouldn't it be rather inefficient? Imagine you have a join
with two tables and two join clauses. (t1.a = t2.a) AND (t1.b = t2.b).
But with your syntax it'd be just

  CREATE STATISTICS s ON (t1.a, t1.b, t2.a, t2.b) FROM t1, t2;

And we'd have to build stats for (at least) 2 joins, because we have no
idea if t1.a joins to t2.a or t2.b.

So -1 to this, IMHO we need the "full" syntax with

 CREATE STATISTICS s ON (t1.c, t2.d)
                   FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b);

We may need some additional statistics to track the selectivity of the
join clauses, in addition to the existing MCV stats built on the join
result.

> But I don't frequently detect problems with JOIN estimation using a
> single join clause. Usually, we have problems with (I) join trees
> (clauses spread across joins) and (II) a single multi-clause join.
> We can't solve (I) here (kinda statistics on a VIEW might help, I
> think), but may ease (II) using:
> 
> CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2;
> 
> or even more bravely:
> 
> CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2
> WHERE (t1.z <> t2.z);
> 

I honestly don't see why this would be better / simpler than the CREATE
STATISTICS grammar that simply allows joins in the FROM part.

regards

-- 
Tomas Vondra