Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1pICBs-0002z9-O4 for pgsql-hackers@arkaria.postgresql.org; Wed, 18 Jan 2023 17:25:36 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1pICBr-0007k6-3b for pgsql-hackers@arkaria.postgresql.org; Wed, 18 Jan 2023 17:25:35 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1pICBq-0007gL-PJ for pgsql-hackers@lists.postgresql.org; Wed, 18 Jan 2023 17:25:34 +0000 Received: from mail-wr1-x429.google.com ([2a00:1450:4864:20::429]) by makus.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1pICBc-00016R-1a for pgsql-hackers@lists.postgresql.org; Wed, 18 Jan 2023 17:25:33 +0000 Received: by mail-wr1-x429.google.com with SMTP id h16so34668778wrz.12 for ; Wed, 18 Jan 2023 09:25:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=GuCBgJt7rZpMxX8rfluIzoZfMHH35NuwE0b/v5COKaI=; b=VZAXWhtriwTJj+Q6ywwGPxlmYz6KLCKB4EaWEfXJwMIDIirKBeP8dS2laTsUOr9JQh 0RfkxyT+n1erDSo/zTjr9xwFirBdTaaxF9EodfTq25QxG+OZ+ZFl4utNBiMW6ppvdIyP yLgg/lzi6HA3yW3tChmyoiEmb5cTUk5kFCeryJzS2+P41hAsmQdqBDzn9fUjwF0zyRRy f4FK2WOTiUWOe8OofFLCa8Jh6JXYgcWs3FQRBecehaQD/Unzv/PABuJglViEnydD15mY lgedFoayPC2iF9paJtGmJ2lQE1fe/DwyyUU4C1UphqfX47HI5JxEnu/tYzG64ovrD+7D iDqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=GuCBgJt7rZpMxX8rfluIzoZfMHH35NuwE0b/v5COKaI=; b=4loA3GHnlU14q1DV3UbyboA7zWuJ0wXy3NIA8dq2mQecsjmdwCJ29ad02vssprxCIO WwHzNfu9ckQfyPoGWpbQhaJFk1uVqU3bwKp57Ev541mVh9YvVgYzgN9J6mPsx7z/5z9e c2OKcjD5N+/7k+4gE5rAxRfKVOvJ38eouzCyqL83n8iAkx6Fve78yjehDPjrFA4SqSQa WTI6d+tHDXHI+indxxG7wzZYw7OpcSbYfNQxHZuXYQ5m7etrq7WUo2xJTYTAJtBvfMFy jl0iGf6XlKBWQg6GE0EHssS2rOA93/gHx2OW+A5LpZBHxiwT2nrww787dgrcKksaDfY+ qNeA== X-Gm-Message-State: AFqh2ko7SkzpgtAyeTBIdiqrblGDulN9KxY6NlARtZeAwrzTgAdPPGmt V2Mz8zrT7bdYOahQVpeLr2KfADPoK0Hy+iJGMwhgEHB1fmr+htY97b6xFf7Nv5ywFXF34mAktZb PhYIMJgS5uFWsXhA0TNLUiDAX9yD41kryVl/SDj3cOFezqeoX5FWtInPMnOyGu0fkIjU7kZSD8u PgZsxSWZXNgtV+UX8mYXRgwbHZARxO1rCjoqcUGABfSdocoX1daxeyGeIoJseV32LrcIqquK0Bd 1hS++arlZHPgUCfnlMnVJKyC1eQyxxV0Dtxz+gim6oh06fnPRwsW4ao4xkI X-Google-Smtp-Source: AMrXdXvtmIJoDFnoB6/p6JSjxPMQCjUygpdeFIOUBp/JwVLv5AyqIzOxETJjRmSwhaiAeydKOoezEQ== X-Received: by 2002:a5d:4fc8:0:b0:256:ff7d:2347 with SMTP id h8-20020a5d4fc8000000b00256ff7d2347mr15180369wrw.13.1674062717097; Wed, 18 Jan 2023 09:25:17 -0800 (PST) Received: from [10.137.0.17] (static-84-42-175-93.bb.vodafone.cz. [84.42.175.93]) by smtp.gmail.com with ESMTPSA id bu9-20020a056000078900b002be1dcb6efbsm5711412wrb.9.2023.01.18.09.25.16 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 18 Jan 2023 09:25:16 -0800 (PST) Message-ID: <8afecd87-d1e5-241c-5e3e-75e1c62c279b@enterprisedb.com> Date: Wed, 18 Jan 2023 18:25:15 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 Subject: Re: Implement missing join selectivity estimation for range types Content-Language: en-US To: Mahmoud Sakr , PostgreSQL Hackers Cc: SCHOEMANS Maxime , Diogo Repas , Luo Zhicheng , Andrey Lepikhov References: From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-CLOUD-SEC-AV-Info: enterprisedb,google_mail,monitor X-CLOUD-SEC-AV-Sent: true X-Gm-Spam: 0 X-Gm-Phishy: 0 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hello Mahmoud, Thanks for the patch and sorry for not taking a look earlier. On 6/30/22 16:31, Mahmoud Sakr wrote: > Hi, > Given a query: > SELECT * FROM t1, t2 WHERE t1.r << t2.r > where t1.r, t2.r are of range type, > currently PostgreSQL will estimate a constant selectivity for the << predicate, > which is equal to 0.005, not utilizing the statistics that the optimizer > collects for range attributes. > > We have worked out a theory for inequality join selectivity estimation > (http://arxiv.org/abs/2206.07396), and implemented it for range > types it in this patch. > Interesting. Are there any particular differences compared to how we estimate for example range clauses on regular columns? > The algorithm in this patch re-uses the currently collected statistics for > range types, which is the bounds histogram. It works fairly accurate for the > operations <<, >>, &&, &<, &>, <=, >= with estimation error of about 0.5%. Right. I think 0.5% is roughly expected for the default statistics target, which creates 100 histogram bins, each representing ~1% of the values. Which on average means ~0.5% error. > The patch also implements selectivity estimation for the > operations @>, <@ (contains and is contained in), but their accuracy is not > stable, since the bounds histograms assume independence between the range > bounds. A point to discuss is whether or not to keep these last two operations. That's a good question. I think the independence assumption is rather foolish in this case, so I wonder if we could "stabilize" this by making some different - less optimistic - assumption. Essentially, we have an estimates for lower/upper boundaries: P1 = P(lower(var1) <= lower(var2)) P2 = P(upper(var2) <= upper(var1)) and independence means we take (P1*P2). But maybe we should be very pessimistic and use e.g. Min(P1,P2)? Or maybe something in between? Another option is to use the length histogram, right? I mean, we know what the average length is, and it should be possible to use that to calculate how "far" ranges in a histogram can overlap. > The patch also includes the selectivity estimation for multirange types, > treating a multirange as a single range which is its bounding box. > OK. But ideally we'd cross-check elements of the two multiranges, no? > The same algorithm in this patch is applicable to inequality joins of scalar > types. We, however, don't implement it for scalars, since more work is needed > to make use of the other statistics available for scalars, such as the MCV. > This is left as a future work. > So if the column(s) contain a couple very common (multi)ranges that make it into an MCV, we'll ignore those? That's a bit unfortunate, because those MCV elements are potentially the main contributors to selectivity. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company