Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vhpom-00757w-2Z for pgsql-hackers@arkaria.postgresql.org; Mon, 19 Jan 2026 14:01:21 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vhpol-00DVoY-1k for pgsql-hackers@arkaria.postgresql.org; Mon, 19 Jan 2026 14:01:19 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vhpol-00DVoO-0L for pgsql-hackers@lists.postgresql.org; Mon, 19 Jan 2026 14:01:19 +0000 Received: from mail-pf1-x429.google.com ([2607:f8b0:4864:20::429]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vhpoi-001FaV-0A for pgsql-hackers@lists.postgresql.org; Mon, 19 Jan 2026 14:01:18 +0000 Received: by mail-pf1-x429.google.com with SMTP id d2e1a72fcca58-81df6a302b1so3562823b3a.2 for ; Mon, 19 Jan 2026 06:01:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1768831276; x=1769436076; darn=lists.postgresql.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=vJ8s7Jom7FYqjmO38C2ADsyKcl0JAxSk2h96Z/QElk0=; b=AOH7tPC5d9LjTXTuWWuXDUJibZDQTpmJVGNTFFCLNwf04h4Ygd/lBRWvhybg9NkES+ 29Q2cbFhl4hIlOn9BE6PsACMSzvmn/VAFTj9XTC0ZnIdTbrdWusXZcC7HuqTSM0ni2dQ wgj8F9Zot+VuS/1PNa/MQgHKwgvw3Idso3S1lSdi5q0vd83wDRLEmSHnIbHwGqABheKG YyWoJr+yyN4O9heJi+EEC8Qp2raFzA9CSpCYQyMhX+Fk7Kliy7se6xhLBLoOgXrrW5eo MHflDYkQntGMnhlBmYFVr7xeIxRBRIMrkz/eiRDtXmYiq8Nx4VYG+eMMe5ivcU8wy6P+ B7Ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1768831276; x=1769436076; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=vJ8s7Jom7FYqjmO38C2ADsyKcl0JAxSk2h96Z/QElk0=; b=EvCi9KeUqd/upZvPbP4NxJiqktcrsLUqaxFJY905+deMHD1y2aMHVzoJDYXgdMHzpp B1lkcmQQTo33AtG+poaexPKn5AQnVVytZnYgrdkX67/4R/c6ep6Lhk4ZCZvhO3LwQuxd x7KAOoc+0F1+DCLB3pRET79yz15v8y8NjYDFJk55Rp20Pbspa3MTBPC9/nSweQ0gNGQI AJL34hUZA+woBfGP1Eone+NZb9buDmuvFDbEvKtXJJVo4xjSxeQ0LTL6MLQJ7mG+lQ2l 3k95L4sU2Ev0pzip3s5syQY2TcRXdp0ElZtEd1O+1c1jknRZ9S+bhhiQvIMrPbuvtELg v+TQ== X-Forwarded-Encrypted: i=1; AJvYcCVsDlHxcU4fSapS/hkIxuWe663Puq/UKxqFY1tGQgiEF5GMS6D819dXTi1O7p33EBRC2sgqHifM8dxT2cIr@lists.postgresql.org X-Gm-Message-State: AOJu0YxsXfBGXRdE7n/hzsMYI28u4RIVkLJIUSqVYJeb9xX5WQsjjVaZ bS6yJKSztz8h5sxZoTLbavYWuYt+Y4Obm8KH20nj5dpFrGVmNtw8ROvBFFof4Q== X-Gm-Gg: AY/fxX4Bub4BTfwl8P+WxFO3dslTGx8Jqe596DRgd5VDDdW7OgLPQKQKIRA1Zjj3mcW jzWGcV6jN58YHl9Vll5i+Qrel8FhnDT+m8p2vFcTWhehNZdBWBnvsvgPH7Yiu1nUN30nWWNCP3L rv9bKym/dd7pR2Psp5uKBWQ9wnHSXp8+cDuHtB9f+iqv7JkpI7DewP/j8ddcvYsXptrI6k7t3bT TzFHGCPCba7oGwrkmGEhdzo3Wfizu+oxhOp4qiyim6t5wIcfvNUUaFNwY00oC4N3zcS2oPeLVCa G6OKxBnPA4oSUck80bVrZ/27A1qr7yd/mbUYBmiI7ZxCCriY5IzvePV/iwXovSln7X1TDTBLIRw ovyvSwfk8du5h4TXG8XG4zuSPwsN4iKfJILOOqmx4JslLXlbaVigBn4YhB/fxQ451ZbYUB4aiAq bVhpbxyc0e3bB1WWhAQIBcGKlqweoajD6TObZ1K8n+z3GDG3oa7QchsHAk X-Received: by 2002:a05:6a00:1704:b0:81f:852b:a91c with SMTP id d2e1a72fcca58-81fa1887b26mr9071255b3a.64.1768831272853; Mon, 19 Jan 2026 06:01:12 -0800 (PST) Received: from [10.255.187.72] (vip-148-139-2-229.cust.service-now.com. [148.139.2.229]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-81fa12b49e0sm9225423b3a.67.2026.01.19.06.01.10 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 19 Jan 2026 06:01:12 -0800 (PST) Message-ID: <6c068ed3-0ab1-4743-862a-343134f2bfa4@gmail.com> Date: Mon, 19 Jan 2026 15:01:07 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Hash-based MCV matching for large IN-lists To: Ilia Evdokimov , PostgreSQL Hackers References: <7db341e0-fbc6-4ec5-922c-11fdafe7be12@tantorlabs.com> <988e3168-6096-488a-bb42-787e1e8c21a4@tantorlabs.com> Content-Language: en-US From: David Geier In-Reply-To: <988e3168-6096-488a-bb42-787e1e8c21a4@tantorlabs.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 14.01.2026 11:19, Ilia Evdokimov wrote: > After thinking more about this I realized that this is actually a better > match for how selectivity is currently modeled. After this comments in > master > >          * If we were being really tense we would try to confirm that the >          * elements are all distinct, but that would be expensive and it >          * doesn't seem to be worth the cycles; it would amount to > penalizing >          * well-written queries in favor of poorly-written ones. > However, we >          * do protect ourselves a little bit by checking whether the >          * disjointness assumption leads to an impossible (out of range) >          * probability; if so, we fall back to the normal calculation. > > when the hash table is built on the IN-list, duplicate IN-list values > are automatically eliminated during insertion, so we no longer risk > summing the same MCV frequency multiple times. This makes the disjoint- > probability estimate more robust and in practice slightly more accurate. Does that mean that we get a different estimation result, depending on if the IN list is smaller or not? I think we should avoid that because estimation quality might flip for the user unexpectedly. > One thing I initially missed that there are actually three different > places where ScalarArrayOpExpr is handled - the Const array case, the > ArrayExpr case and others - and Const and ArrayExpr require different > implementation of the same idea. In Const case we can directly hash and > probe Datum value, while ArrayExpr case we must work on Node* element, > separating constant and non-constant entries and only hashing the > constants. The current v2 therefore applies the same MCV-hash > optimization in both branches, but using two tailored code paths that > preserve the existing semantics of how non-Const elements are handled by > var_eq_non_const(). > > If the MCV list is smaller than the IN-list, the behavior is the same as > in v1 of the patch. If the IN-list is smaller, we instead build a hash > table over the distinct constant elements of the IN-list and then: > - Scan the MCV list and sum the frequencies of those MCVs that appear in > the IN-list; > - Count how many distinct IN-list not null constant elements are not > present in the MCV list; Is this to make sure we keep getting the same estimation result if the IN list is smaller and contains duplicates? > - Estimate the probability of each such non-MCV value using the > remaining frequency mass; > - Handle non-constant IN-list elements separately using > var_eq_non_const(), exactly as in the existing implementation. OK >>> >> The code in master currently calls an operator-specific selectivity >> estimation function. For equality this is typically eqsel() but the >> function can be specified during CREATE OPERATOR. >> >> Can be safely special-case the behavior of eqsel() for all possible >> operators for the ScalarArrayOpExpr case? > > > Unfortunately there is no safe way to make this optimization generic for > arbitrary restrict functions, because a custom RESTRICT function does > not have to use MCVs at all. IMO, in practice the vast majority of > ScalarArrayOpExpr uses with = or <> rely on the built-in equality > operators whose selectivity is computed by eqsel()/neqsel(), so I > limited this optimization to those cases. How did you do that? I cannot find the code that checks for that. > I’ve attached v2 of the patch. It currently uses two fairly large helper > functions for the Const and ArrayExpr cases; this is intentional to keep > the logic explicit and reviewable, even though these will likely need > refactoring or consolidation later. Beyond that, it seems like you can also combine/reuse a bunch of code for creating the hash map on the IN vs on the MCV list. For the MCVs, can't we reuse some code from the eqjoinsel() optimization we did? The entry and context structs look similar enough to only need one. Making the code more compact would ease reviewing a lot. -- David Geier