Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vmwJ1-000OM0-2u for pgsql-hackers@arkaria.postgresql.org; Mon, 02 Feb 2026 15:57:39 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vmwJ0-000FTT-35 for pgsql-hackers@arkaria.postgresql.org; Mon, 02 Feb 2026 15:57:38 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vmwJ0-000FTJ-1a for pgsql-hackers@lists.postgresql.org; Mon, 02 Feb 2026 15:57:38 +0000 Received: from relay8-d.mail.gandi.net ([217.70.183.201]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vmwIx-00000000Cun-2SRa for pgsql-hackers@lists.postgresql.org; Mon, 02 Feb 2026 15:57:37 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 46C0444352; Mon, 2 Feb 2026 15:57:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1770047852; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ILL5pg6zkUw8MvOs68OXP6vpxlNAyn6rdj3U9US6agI=; b=R7PbjwY3XNG8npKJNjDMs2Rfj1jfKNVSuOR8Z9y48/V4AjG4cipfJP60/R2q6rQ4rUuaAz MwdO4SZrKQRbKlai0F87lXLi6Bf7HCXuX1kCA7RwtIsulD9WydWXsROa1eNHEhGAFdlQ1b +ZjnEYO6KEB2Bw54dlJ9/blChOom+53wc6D3GNeaU3hz350E+WSoZI1KptPKSHyrNfhoN9 ETKTbxI4379WhykwGtMTNqe0DgqUxTXjYPeG2l8bX0DP9/pMyd1OLmPV63q/NPk5uxdiG3 n1QH+UL2i9ZWGXbnf4vUgFp0HTI9g5WZMMH0AEr5gYGLOWTt03BcO5xWT+aEZw== Message-ID: Date: Mon, 2 Feb 2026 16:57:26 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Is there value in having optimizer stats for joins/foreignkeys? To: Andrei Lepikhov , Alexandra Wang , Corey Huinker Cc: Tom Lane , pgsql-hackers@lists.postgresql.org, hs@cybertec.at, Jeff Davis References: <246035.1764627115@sss.pgh.pa.us> <8df3d212-5d60-4e30-9606-d8849f7d37ae@gmail.com> <3c477f2f-10e4-4705-bb21-90ccbe67e9d2@gmail.com> Content-Language: en-US From: Tomas Vondra In-Reply-To: <3c477f2f-10e4-4705-bb21-90ccbe67e9d2@gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-GND-Sasl: tomas@vondra.me X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddujeektdeiucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefkffggfgfuvfevfhfhjggtgfesthejredttddvjeenucfhrhhomhepvfhomhgrshcugghonhgurhgruceothhomhgrshesvhhonhgurhgrrdhmvgeqnecuggftrfgrthhtvghrnhepledugeeikefglefhgfffuedvleetteevgefhvdeikeefudduuddvhfevudefhfevnecukfhppeekiedrgeelrddvfedtrddvtdeinecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepkeeirdegledrvdeftddrvddtiedphhgvlhhopegluddtrddufeejrddtrddvngdpmhgrihhlfhhrohhmpehtohhmrghssehvohhnughrrgdrmhgvpdhqihgupeegieevtdeggeegfeehvddpmhhouggvpehsmhhtphhouhhtpdhnsggprhgtphhtthhopeejpdhrtghpthhtoheplhgvphhihhhovhesghhmrghilhdrtghomhdprhgtphhtthhopegrlhgvgigrnhgurhgrfigrnhhgohhsshesghhmrghilhdrtghomhdprhgtphhtthhopegtohhrvgihhhhuihhnkhgvrhesghhmrghilhdrtghomhdprhgtphhtthhopehtghhlsehsshhsrdhpghhhrdhprgdruhhspdhrtghpthhtohepphhgshhqlhdqhhgrtghkvghrsheslhhishhts hdrphhoshhtghhrvghsqhhlrdhorhhgpdhrtghpthhtohephhhssegthigsvghrthgvtgdrrght X-GND-State: clean X-GND-Score: -100 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 2/2/26 10:53, Andrei Lepikhov wrote: > On 1/2/26 17:39, Tomas Vondra wrote: >> We can't simply store an opaque VIEW, and build the stats by simply >> executing it (and sampling the results). The whole premise of extended >> stats is that people define them to fix incorrect estimates. And with >> incorrect estimates the plan may be terrible, and the VIEW may not even >> complete. > > Ok, I got the point. > I think linking to a join or foreign key seems restrictive. In my mind, > extended statistics may go the following way: > I agree we don't need to restrict to joins on foreign keys. I assume the PoC patch requires f-keys because it makes building the join sample much simpler / easier to think about. The paper "sampling done right" paper I mentioned explains how to sample general joins, as long as there are appropriate indexes. Foreign keys always have those, but the constraint itself is not needed. FWIW I think it's perfectly acceptable to allow extended stats only on joins with appropriate indexes. Because without that we can't do the sampling efficiently (or possibly at all). But I'd leave this for later. For now it's perfectly fine to limit the scope to FK joins, and then maybe expand the scope once we figure out the other pieces. > CREATE STATISTICS abc_stat ON (t1.x,t2.y,t3.z) FROM t1,t2,t3; > > Suppose t1.x,t2.y, and t3.z have a common equality operator. > > Here we can build statistics on (t1.x = t2.y), (t1.x = t3.z), (t2.y = > t3.z), and potentially (t1.x = t2.y = t3.z). > If I understand correctly you suggest we generate all "possible" joins joining on the columns specified in the ON clause. I think we shouldn't do that, as it's confused about the purpose of the ON clause. That's meant to specify the list of columns on which to build the extended statistic, but now it would be generating join clauses. It has to be possible to have non-join attributes in the ON clause, because that's how we can track correlation between the tables. Which is the whole point, I think. It's not about the join clause selectivity, or at least not just about it. Moreover, wouldn't it be rather inefficient? Imagine you have a join with two tables and two join clauses. (t1.a = t2.a) AND (t1.b = t2.b). But with your syntax it'd be just CREATE STATISTICS s ON (t1.a, t1.b, t2.a, t2.b) FROM t1, t2; And we'd have to build stats for (at least) 2 joins, because we have no idea if t1.a joins to t2.a or t2.b. So -1 to this, IMHO we need the "full" syntax with CREATE STATISTICS s ON (t1.c, t2.d) FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b); We may need some additional statistics to track the selectivity of the join clauses, in addition to the existing MCV stats built on the join result. > But I don't frequently detect problems with JOIN estimation using a > single join clause. Usually, we have problems with (I) join trees > (clauses spread across joins) and (II) a single multi-clause join. > We can't solve (I) here (kinda statistics on a VIEW might help, I > think), but may ease (II) using: > > CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2; > > or even more bravely: > > CREATE STATISTICS abc_stat ON ((t1.x=t2.x),(t1.y=t2.y)) FROM t1,t2 > WHERE (t1.z <> t2.z); > I honestly don't see why this would be better / simpler than the CREATE STATISTICS grammar that simply allows joins in the FROM part. regards -- Tomas Vondra