public inbox for [email protected]  
help / color / mirror / Atom feed
From: David G. Johnston <[email protected]>
To: Bruce Momjian <[email protected]>
Cc: [email protected]
Cc: Pg Docs <[email protected]>
Cc: Tom Lane <[email protected]>
Subject: Re: incorrect information in documentation
Date: Tue, 12 Apr 2022 14:36:47 -0700
Message-ID: <CAKFQuwax7V5R_rw=EOWmy=TBON6v3sveBx_WvwsENskCL5CLQQ@mail.gmail.com> (raw)
In-Reply-To: <CAKFQuwaV87e4sdFdZab4+zcUdza+EpveQi_98=f9wJ1+QLUALw@mail.gmail.com>
References: <[email protected]>
	<CAKFQuwbHsQ14v3i4oEwPOmgEBqEFjG4i=WsuOMmP2q82myvdQQ@mail.gmail.com>
	<CAKFQuwYQacXmG9zMcOgbBtGp-1QusGjXQBfMez9GcHpv=AX6SQ@mail.gmail.com>
	<[email protected]>
	<CAKFQuwaV87e4sdFdZab4+zcUdza+EpveQi_98=f9wJ1+QLUALw@mail.gmail.com>

On Mon, Aug 9, 2021 at 8:40 PM David G. Johnston <[email protected]>
wrote:

> On Mon, Aug 9, 2021 at 11:05 AM Bruce Momjian <[email protected]> wrote:
>
>>
>> >         selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/
>> >         num_distinct1,
>> >         1/num_distinct2)
>> >                     = (1 - 0) * (1 - 0) / max(10000, 10000)
>> >                     = 0.0001
>>
>> Nice, can you provide a patch please?
>>
>>
> Change the line:
>
>
Concretely, as attached and inline.

David J.


commit 73fa486a855d75d74a1a695bb350bfbfe27c7751
Author: David G. Johnston <[email protected]>
Date:   Tue Apr 12 21:23:53 2022 +0000

    doc: make unique non-null join selectivity example match the prose

    The description of the computation for the unique, non-null,
    join selectivity describes a division by the maximum of two values,
    while the example shows a multiplication by their reciprocal.  While
    equivalent the max phrasing is easier to understand; which seems
    more important here than precisely adhering to the formula use
    in the code (for which either variant is still an approximation).

    While both num_distinct and num_rows are equal for a unique column
    both the concept and formula use row count (10,000) and the
    field num_distinct has already been set to mean the specific value
    present in the pg_stats table (i.e, -1), so use num_rows here.

diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml
index 78053d7c49..f72bc4b274 100644
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -391,18 +391,20 @@ tablename  | null_frac | n_distinct | most_common_vals
 </programlisting>

    In this case there is no <acronym>MCV</acronym> information for
-   <structfield>unique2</structfield> because all the values appear to be
-   unique, so we use an algorithm that relies only on the number of
-   distinct values for both relations together with their null fractions:
+   <structname>unique2</structname> and all the values appear to be
+   unique (n_distinct = -1), so we use an algorithm that relies on the row
+   count estimates for both relations (num_rows, not shown, but "tenk")
+   together with the column null fractions (zero for both):

 <programlisting>
-selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
1/num_distinct2)
+selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1,
num_rows2)
             = (1 - 0) * (1 - 0) / max(10000, 10000)
             = 0.0001
 </programlisting>

    This is, subtract the null fraction from one for each of the relations,
-   and divide by the maximum of the numbers of distinct values.
+   and divide by the row count of the larger relation (this value does get
+   scaled in the non-unique case).
    The number of rows
    that the join is likely to emit is calculated as the cardinality of the
    Cartesian product of the two inputs, multiplied by the


Attachments:

  [application/octet-stream] v0001-doc-make-row-estimation-example-match-prose.patch (2.4K, 3-v0001-doc-make-row-estimation-example-match-prose.patch)
  download | inline diff:
commit 73fa486a855d75d74a1a695bb350bfbfe27c7751
Author: David G. Johnston <[email protected]>
Date:   Tue Apr 12 21:23:53 2022 +0000

    doc: make unique non-null join selectivity example match the prose
    
    The description of the computation for the unique, non-null,
    join selectivity describes a division by the maximum of two values,
    while the example shows a multiplication by their reciprocal.  While
    equivalent the max phrasing is easier to understand; which seems
    more important here than precisely adhering to the formula use
    in the code (for which either variant is still an approximation).
    
    While both num_distinct and num_rows are equal for a unique column
    both the concept and formula use row count (10,000) and the
    field num_distinct has already been set to mean the specific value
    present in the pg_stats table (i.e, -1), so use num_rows here.

diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml
index 78053d7c49..f72bc4b274 100644
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -391,18 +391,20 @@ tablename  | null_frac | n_distinct | most_common_vals
 </programlisting>
 
    In this case there is no <acronym>MCV</acronym> information for
-   <structfield>unique2</structfield> because all the values appear to be
-   unique, so we use an algorithm that relies only on the number of
-   distinct values for both relations together with their null fractions:
+   <structname>unique2</structname> and all the values appear to be
+   unique (n_distinct = -1), so we use an algorithm that relies on the row
+   count estimates for both relations (num_rows, not shown, but "tenk")
+   together with the column null fractions (zero for both):
 
 <programlisting>
-selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
+selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1, num_rows2)
             = (1 - 0) * (1 - 0) / max(10000, 10000)
             = 0.0001
 </programlisting>
 
    This is, subtract the null fraction from one for each of the relations,
-   and divide by the maximum of the numbers of distinct values.
+   and divide by the row count of the larger relation (this value does get
+   scaled in the non-unique case).
    The number of rows
    that the join is likely to emit is calculated as the cardinality of the
    Cartesian product of the two inputs, multiplied by the


reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: incorrect information in documentation
  In-Reply-To: <CAKFQuwax7V5R_rw=EOWmy=TBON6v3sveBx_WvwsENskCL5CLQQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox