Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wP6Ed-000Qiv-35 for pgsql-hackers@arkaria.postgresql.org; Mon, 18 May 2026 22:14:52 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wP6Db-003ITu-1F for pgsql-hackers@arkaria.postgresql.org; Mon, 18 May 2026 22:13:48 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wP6Da-003ITk-2x for pgsql-hackers@lists.postgresql.org; Mon, 18 May 2026 22:13:48 +0000 Received: from forward501d.mail.yandex.net ([2a02:6b8:c41:1300:1:45:d181:d501]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wP6DY-00000000GM2-2mek for pgsql-hackers@lists.postgresql.org; Mon, 18 May 2026 22:13:47 +0000 Received: from mail-nwsmtp-smtp-production-main-95.klg.yp-c.yandex.net (mail-nwsmtp-smtp-production-main-95.klg.yp-c.yandex.net [IPv6:2a02:6b8:c42:2a21:0:640:9c41:0]) by forward501d.mail.yandex.net (Yandex) with ESMTPS id 904DC81078; Tue, 19 May 2026 01:13:41 +0300 (MSK) Received: by mail-nwsmtp-smtp-production-main-95.klg.yp-c.yandex.net (smtp) with ESMTPSA id dDZhbfFRL8c0-fUc94Rzj; Tue, 19 May 2026 01:13:40 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tantorlabs.com; s=mail; t=1779142420; bh=GWKYpId92L7sgpKci+rcmAKWv7JKvGpxKZko83oEp1Q=; h=In-Reply-To:References:To:Subject:Message-ID:Date:From; b=Fo++uKtBD3mQmBtiJAaRnTKKCbdt7GjKmdAfBFKlpjCKUaxiUal/CYR2kz02ow2ja SUbKHYgoMPt7mPAqTS2BORz6u9g0JF9xcrL8HnB01e2Lt0Ujpm5cTU5q5DpN3hskvt iV3dC0DSniVz+qXP1GvQe8WowXkUqS1PYpaMGszQ= Authentication-Results: mail-nwsmtp-smtp-production-main-95.klg.yp-c.yandex.net; dkim=pass header.i=@tantorlabs.com Content-Type: multipart/alternative; boundary="------------HUM0S3z2abOrBI0EgJKRqpuY" Message-ID: <00d52d6f-ae9e-40df-80d3-9e96a23efa70@tantorlabs.com> Date: Tue, 19 May 2026 01:13:39 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Extended statistics improvement: multi-column MCV missing values To: =?UTF-8?Q?Enrique_S=C3=A1nchez?= , pgsql-hackers@lists.postgresql.org References: Content-Language: en-US From: Ilia Evdokimov In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------HUM0S3z2abOrBI0EgJKRqpuY Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hi Enrique, On 5/18/26 19:09, Enrique Sánchez wrote: > > Postgres only uses multi-column MCVs when the value we are looking for > is in the list. If not, it falls back into individual independent > statistics to estimate selectivity. > However, a miss in a multi-column MCV list still yields valuable > information that it currently throws away: we know that the > combination's frequency is strictly bounded by the frequency of the > last (least common) item in that MCV list. LGTM. If the multicolumn MCV statistics exists and the clause combination is absent from the MCV-list, we can use the least frequent MCV item as an upper bound. BTW, this only applies to AND-clauses. > > 2. Estimate selectivity as Postgres does for single-column values not > in MCVs > ============================================================================= > While that significantly improves estimations, we could mirror what > Postgres already does for individual MCVs. Quote from the official > documentation: > > The approach is to use the fact that the value is not in the list, > combined with the knowledge of the frequencies for all of the MCVs: > > That is, add up all the frequencies for the MCVs and subtract them > from one, then divide by the number of other distinct values. > > To achieve this, we need to store an ndistinct estimation alongside > the MCVs that can be used for partial or entire column match. > > P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) - > MCV_list_size) > > > ... > > I think this is a cheap way to prevent bad estimations. The storage > overhead of adding an ndistinct field is trivial compared to the MCV > list itself, and the O(1) arithmetic during planning adds no > measurable overhead. I look forward to your feedback before drafting a > patch. For this, the ndistinct extended statistics already exist. If both MCV and ndistinct statistics are present on the same column set, the formula is correct. There are already places in the code that compute ndistinct for columns without extended ndistinct statistics (see estimate_num_groups) - but it is worth thinking carefully about whether the added complexity is justified before going down that path. -- Best regards, Ilia Evdokimov, Tantor Labs LLC, https://tantorlabs.com/ --------------HUM0S3z2abOrBI0EgJKRqpuY Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit

Hi Enrique,

On 5/18/26 19:09, Enrique Sánchez wrote:


Postgres only uses multi-column MCVs when the value we are looking for is in the list. If not, it falls back into individual independent statistics to estimate selectivity.
However, a miss in a multi-column MCV list still yields valuable information that it currently throws away: we know that the combination's frequency is strictly bounded by the frequency of the last (least common) item in that MCV list.

LGTM. If the multicolumn MCV statistics exists and the clause combination is absent from the MCV-list, we can use the least frequent MCV item as an upper bound. BTW, this only applies to AND-clauses.



2. Estimate selectivity as Postgres does for single-column values not in MCVs
=============================================================================
While that significantly improves estimations, we could mirror what Postgres already does for individual MCVs. Quote from the official documentation:
> The approach is to use the fact that the value is not in the list, combined with the knowledge of the frequencies for all of the MCVs:
> That is, add up all the frequencies for the MCVs and subtract them from one, then divide by the number of other distinct values.

To achieve this, we need to store an ndistinct estimation alongside the MCVs that can be used for partial or entire column match.

P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) - MCV_list_size)


...

I think this is a cheap way to prevent bad estimations. The storage overhead of adding an ndistinct field is trivial compared to the MCV list itself, and the O(1) arithmetic during planning adds no measurable overhead. I look forward to your feedback before drafting a patch.

For this, the ndistinct extended statistics already exist. If both MCV and ndistinct statistics are present on the same column set, the formula is correct. There are already places in the code that compute ndistinct for columns without extended ndistinct statistics (see estimate_num_groups) - but it is worth thinking carefully about whether the added complexity is justified before going down that path.

-- 
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

--------------HUM0S3z2abOrBI0EgJKRqpuY--