Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vcgM3-009aib-0K for pgsql-hackers@arkaria.postgresql.org; Mon, 05 Jan 2026 08:54:23 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vcgM2-000mNg-06 for pgsql-hackers@arkaria.postgresql.org; Mon, 05 Jan 2026 08:54:22 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vcgM1-000mN7-2L for pgsql-hackers@lists.postgresql.org; Mon, 05 Jan 2026 08:54:22 +0000 Received: from mail-wr1-x430.google.com ([2a00:1450:4864:20::430]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vcgM0-004IGJ-2X for pgsql-hackers@lists.postgresql.org; Mon, 05 Jan 2026 08:54:21 +0000 Received: by mail-wr1-x430.google.com with SMTP id ffacd0b85a97d-42e2ba54a6fso5913009f8f.3 for ; Mon, 05 Jan 2026 00:54:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1767603259; x=1768208059; darn=lists.postgresql.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=iJbV3+JtPSoRAn0Kh/5yjcNLSlgLQqSwghrQ04zIk4w=; b=bygY5MG/KwWrc7P/fNupI2hDmXznYt2rGUGeNDff6LXADIRfixWCeE7hU8Xto1xyxj 0o3LNFjxMO8mxf98iTY8MatnJKnjJjhnaXjRyFmvaKJC9NCDWuu2HoSPL9xeF1isDDQk z/xoEgVtU+HAmgB7iySKMpYU5CUmCwXoHj2QT2ZaOB32J1S09cJetWETvvGxM+GADw5K kVmHKKEiTg/XxWy2+nThWoE4IufFA6Gmx8nadv6hVS3Oj6l0zrQBy2xPAyB1teHRIdAb b4RSrxso77gu83JH2J0tiw94lDGIJva//HEt0ZOBuzuSdkiZJI6qDAmziSQb0bnye/b5 166w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767603259; x=1768208059; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=iJbV3+JtPSoRAn0Kh/5yjcNLSlgLQqSwghrQ04zIk4w=; b=jM/8qvrTndffzqb/+6740Lp8ArD0G0lduA/S8eJQ7GE5nq/GHDTA0ZY7xD5MJYnIcK 0ITx5XDfdV0/I0oQUjThivUi/Goeiv0x5BfB4EvQHYiWxVuiIw21piYkAL0DrPLqab1B bUAi/QZztC0dLS47fYWhnFKgPo1QfJwm4HRIwXPPt2otCUfsD8PzG7eGZhWxJaDAY50j +TVpFtA3VaSZgTRsLFRnApXXgLRgCBwS9jr78MdvF9i1WxBdzS8oGxW930gW/yeGpFQY GzN7GnbM88o0MLR16+M8V3Zy99NFtsL1fkBt3AXEC2V1aAsDvsJhb9NWOb8Uu0An88uX 0P7w== X-Forwarded-Encrypted: i=1; AJvYcCVkMMz96iOgtbYDgNF6pYFPLFbs9bdK6vt3Sfzw2kdNF76r/ktqgzyrWf/Ll1/ujzyWk/mxCHcJLJTAzj0g@lists.postgresql.org X-Gm-Message-State: AOJu0Ywp7P94+y560MOHh3WeYeXMdnk5xJ4S/+vj2qz0kHonetosojz7 E5HQxDZJppADJrSm609UI72Qtaesywbj3LT828pUb/tHMWi6mcbN/5MJEAfbQA== X-Gm-Gg: AY/fxX4YKnqQptQqPI6EyAVjNnfYeSg2G4ZQban5/YOSRG+QN/BwY5pwYSiI+cqbLEG 6gU2R/xhwfPSxrxI4b6EwQMCHhyM9Unpf7r2J6Qhekm96BYEzei5Yh0bZjQP/QfpjRoQ0ApsV41 Cs6lp+n4vQyyBsIMFeoPzDpibLRwIM9qQ2PVPzcJar3B5xEceAKRMQQV046YZlZiOF422DshEOp dE6d4ULt4+u/eE/PpsLobGFtZazOcWRJPWTKolZSCj+ExwNr8FyXdnpHuaG/hZ8yfAU8r6CwB+7 FmpcOoRVe6VGcmz9pY/isLz32lPfX4bVVxlDUOM7IplJUPNNtA8g/fvgX581kcIAMuVJ8kAOaSc cLx2CF4ZxYLIvkKR0s0axZCsoHLBPc3A4/SfoAJWy1WMIgrg+wcgHVm0BTWbLNWsxYdezrmQT8o ilVqxTiT0NaCGM4Yk= X-Google-Smtp-Source: AGHT+IERgg3gi76P8ypI5VQvD7C+H5/kx1cBx2zp3h7mLTATVSxgTAFuoEP7M2ULpQvsXcEujPn50Q== X-Received: by 2002:a05:6000:144c:b0:42f:b9c6:c89d with SMTP id ffacd0b85a97d-4324e50a6d8mr60787692f8f.52.1767603258874; Mon, 05 Jan 2026 00:54:18 -0800 (PST) Received: from [192.168.2.32] ([147.161.234.249]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-43264613923sm84951455f8f.26.2026.01.05.00.54.17 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 05 Jan 2026 00:54:18 -0800 (PST) Message-ID: Date: Mon, 5 Jan 2026 09:54:17 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Hash-based MCV matching for large IN-lists To: Ilia Evdokimov , PostgreSQL Hackers References: <7db341e0-fbc6-4ec5-922c-11fdafe7be12@tantorlabs.com> Content-Language: en-US From: David Geier In-Reply-To: <7db341e0-fbc6-4ec5-922c-11fdafe7be12@tantorlabs.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi Ilia! On 29.12.2025 21:35, Ilia Evdokimov wrote: > Hi hackers, > > When estimating selectivity for ScalarArrayOpExpr (IN, ANY, ALL) and MCV > statistics are available for the column, the planner currently matches > IN-list elements against the MCV array using nested loop. For large IN- > list and large MCV arrays this results in O(N*M) behavior, which can > become unnecessarily expensive during planning. > > Thanks to David for pointing out this case [0] > Cool that you tackled this. I've seen this happening a lot in practice. > This patch introduces a hash-based matching path, analogous to what is > already done for MCV matching in join selectivity estimation (057012b > commit). Instead of linearly scanning the MCV array for each IN-list > element, we build a hash table and probe it to identify matches. > > The hash table is built over the MCV values, not over the IN-list. The > IN-list may contain NULLs, non-Const expressions, and duplicate values, > whereas the MCV list is guaranteed to contain distinct, non-NULL values > and represents the statistically meaningful domain we are matching > against. Hashing the MCVs therefore avoids duplicate work and directly > supports selectivity estimation. The downside of doing it this way is that we always pay the price of building a possibly big hash table if the column has a lot of MCVs, even for small IN lists. Why can't we build the hash table always on the smaller list, like we do already in the join selectivity estimation? For NULL we can add a flag to the hash entry, non-Const expressions must anyways be evaluated and duplicate values will be discarded during insert. > > For each IN-list element, if a matching MCV is found, we add the > corresponding MCV frequency to the selectivity estimate. If no match is > found, the remaining selectivity is estimated in the same way as the > existing non-MCV path (similar to var_eq_const when the constant is not > present in the MCV list). > The code in master currently calls an operator-specific selectivity estimation function. For equality this is typically eqsel() but the function can be specified during CREATE OPERATOR. Can be safely special-case the behavior of eqsel() for all possible operators for the ScalarArrayOpExpr case? > The hash-based path is enabled only when both a sufficiently large IN- > list and an MCV list are present, and suitable hash functions exist for > the equality operator. The threshold is currently the same as the one > used for join MCV hashing, since the underlying algorithmic tradeoffs > are similar. Seems reasonable. I'll test and review in more detail once we clarified the design. -- David Geier