incorrect information in documentation

public inbox for [email protected]  
help / color / mirror / Atom feed

incorrect information in documentation
6+ messages / 3 participants
[nested] [flat]

* incorrect information in documentation
@ 2021-08-09 12:53 PG Doc comments form <[email protected]>
  2021-08-09 16:06 ` Re: incorrect information in documentation David G. Johnston <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: PG Doc comments form @ 2021-08-09 12:53 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/13/row-estimation-examples.html
Description:

Hello, on page
https://www.postgresql.org/docs/current/row-estimation-examples.html - there
is a example:
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
1/num_distinct2)
            = (1 - 0) * (1 - 0) / max(10000, 10000)
            = 0.0001
in the first string  " * min" and in the second " / max" 
as I understand it isn't correct.


^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: incorrect information in documentation
  2021-08-09 12:53 incorrect information in documentation PG Doc comments form <[email protected]>
@ 2021-08-09 16:06 ` David G. Johnston <[email protected]>
  2021-08-09 16:20   ` Re: incorrect information in documentation David G. Johnston <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: David G. Johnston @ 2021-08-09 16:06 UTC (permalink / raw)
  To: [email protected]; Pg Docs <[email protected]>

On Mon, Aug 9, 2021 at 8:02 AM PG Doc comments form <[email protected]>
wrote:

> Hello, on page
> https://www.postgresql.org/docs/current/row-estimation-examples.html -
> there
> is a example:
> selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
> 1/num_distinct2)
>             = (1 - 0) * (1 - 0) / max(10000, 10000)
>             = 0.0001
> in the first string  " * min" and in the second " / max"
> as I understand it isn't correct.
>

Division is just multiplication by the reciprocal so while the presentation
here is inconsistent it is correct.  Likewise, the larger a number the
smaller its reciprocal, so the change from min to max also works.

David J.

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: incorrect information in documentation
  2021-08-09 12:53 incorrect information in documentation PG Doc comments form <[email protected]>
  2021-08-09 16:06 ` Re: incorrect information in documentation David G. Johnston <[email protected]>
@ 2021-08-09 16:20   ` David G. Johnston <[email protected]>
  2021-08-09 18:05     ` Re: incorrect information in documentation Bruce Momjian <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: David G. Johnston @ 2021-08-09 16:20 UTC (permalink / raw)
  To: [email protected]; Pg Docs <[email protected]>; Tom Lane <[email protected]>

On Mon, Aug 9, 2021 at 9:06 AM David G. Johnston <[email protected]>
wrote:

> On Mon, Aug 9, 2021 at 8:02 AM PG Doc comments form <
> [email protected]> wrote:
>
>> Hello, on page
>> https://www.postgresql.org/docs/current/row-estimation-examples.html -
>> there
>> is a example:
>> selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
>> 1/num_distinct2)
>>             = (1 - 0) * (1 - 0) / max(10000, 10000)
>>             = 0.0001
>> in the first string  " * min" and in the second " / max"
>> as I understand it isn't correct.
>>
>
> Division is just multiplication by the reciprocal so while the
> presentation here is inconsistent it is correct.  Likewise, the larger a
> number the smaller its reciprocal, so the change from min to max also
> works.
>
>
FWIW this used to be presented with the calculation and formula in sync,
but the original had a simple typo in the calculation.  When the typo got
fixed back in December of 2007 [1] the author of the patch simplified the
calculation at the same time.  I suggest we update the formula line to
match the calculation presented.

David J.

1.
https://github.com/postgres/postgres/commit/f5678e8e07563e34ae4dc832546977d13edcd665

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: incorrect information in documentation
  2021-08-09 12:53 incorrect information in documentation PG Doc comments form <[email protected]>
  2021-08-09 16:06 ` Re: incorrect information in documentation David G. Johnston <[email protected]>
  2021-08-09 16:20   ` Re: incorrect information in documentation David G. Johnston <[email protected]>
@ 2021-08-09 18:05     ` Bruce Momjian <[email protected]>
  2021-08-10 03:40       ` Re: incorrect information in documentation David G. Johnston <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Bruce Momjian @ 2021-08-09 18:05 UTC (permalink / raw)
  To: David G. Johnston <[email protected]>; +Cc: [email protected]; Pg Docs <[email protected]>; Tom Lane <[email protected]>

On Mon, Aug  9, 2021 at 09:20:53AM -0700, David G. Johnston wrote:
> On Mon, Aug 9, 2021 at 9:06 AM David G. Johnston <[email protected]>
> wrote:
> 
>     On Mon, Aug 9, 2021 at 8:02 AM PG Doc comments form <[email protected]
>     > wrote:
> 
>         Hello, on page
>         https://www.postgresql.org/docs/current/row-estimation-examples.html -
>         there
>         is a example:
>         selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/
>         num_distinct1,
>         1/num_distinct2)
>                     = (1 - 0) * (1 - 0) / max(10000, 10000)
>                     = 0.0001
>         in the first string  " * min" and in the second " / max"
>         as I understand it isn't correct.
> 
> 
>     Division is just multiplication by the reciprocal so while the presentation
>     here is inconsistent it is correct.  Likewise, the larger a number the
>     smaller its reciprocal, so the change from min to max also works.  
> 
> 
> 
> FWIW this used to be presented with the calculation and formula in sync, but
> the original had a simple typo in the calculation.  When the typo got fixed
> back in December of 2007 [1] the author of the patch simplified the calculation
> at the same time.  I suggest we update the formula line to match the
> calculation presented.

Nice, can you provide a patch please?

-- 
  Bruce Momjian  <[email protected]>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: incorrect information in documentation
  2021-08-09 12:53 incorrect information in documentation PG Doc comments form <[email protected]>
  2021-08-09 16:06 ` Re: incorrect information in documentation David G. Johnston <[email protected]>
  2021-08-09 16:20   ` Re: incorrect information in documentation David G. Johnston <[email protected]>
  2021-08-09 18:05     ` Re: incorrect information in documentation Bruce Momjian <[email protected]>
@ 2021-08-10 03:40       ` David G. Johnston <[email protected]>
  2022-04-12 21:36         ` Re: incorrect information in documentation David G. Johnston <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: David G. Johnston @ 2021-08-10 03:40 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: [email protected]; Pg Docs <[email protected]>; Tom Lane <[email protected]>

On Mon, Aug 9, 2021 at 11:05 AM Bruce Momjian <[email protected]> wrote:

>
> >         selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/
> >         num_distinct1,
> >         1/num_distinct2)
> >                     = (1 - 0) * (1 - 0) / max(10000, 10000)
> >                     = 0.0001
>
> Nice, can you provide a patch please?
>
>
Change the line:

selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
1/num_distinct2)

to be:

selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_distinct1,
num_distinct2)

The wording already talks about "divide by max".

Though:

"so we use an algorithm that relies only on the number of distinct values
for both relations together with their null fractions:"

maybe adds a parenthetical note:

"so we use an algorithm that relies only on the number of distinct values
(the row count estimate for the whole table, not the -1 in the column
statistics) for both relations together with their null fractions:"

Just note I haven't tried to absorb that whole page, let alone the
implementation, and am not all that familiar with this part of PostgreSQL.
Its seems right, though, in isolation.

David J.


^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: incorrect information in documentation
  2021-08-09 12:53 incorrect information in documentation PG Doc comments form <[email protected]>
  2021-08-09 16:06 ` Re: incorrect information in documentation David G. Johnston <[email protected]>
  2021-08-09 16:20   ` Re: incorrect information in documentation David G. Johnston <[email protected]>
  2021-08-09 18:05     ` Re: incorrect information in documentation Bruce Momjian <[email protected]>
  2021-08-10 03:40       ` Re: incorrect information in documentation David G. Johnston <[email protected]>
@ 2022-04-12 21:36         ` David G. Johnston <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: David G. Johnston @ 2022-04-12 21:36 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: [email protected]; Pg Docs <[email protected]>; Tom Lane <[email protected]>

On Mon, Aug 9, 2021 at 8:40 PM David G. Johnston <[email protected]>
wrote:

> On Mon, Aug 9, 2021 at 11:05 AM Bruce Momjian <[email protected]> wrote:
>
>>
>> >         selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/
>> >         num_distinct1,
>> >         1/num_distinct2)
>> >                     = (1 - 0) * (1 - 0) / max(10000, 10000)
>> >                     = 0.0001
>>
>> Nice, can you provide a patch please?
>>
>>
> Change the line:
>
>
Concretely, as attached and inline.

David J.


commit 73fa486a855d75d74a1a695bb350bfbfe27c7751
Author: David G. Johnston <[email protected]>
Date:   Tue Apr 12 21:23:53 2022 +0000

    doc: make unique non-null join selectivity example match the prose

    The description of the computation for the unique, non-null,
    join selectivity describes a division by the maximum of two values,
    while the example shows a multiplication by their reciprocal.  While
    equivalent the max phrasing is easier to understand; which seems
    more important here than precisely adhering to the formula use
    in the code (for which either variant is still an approximation).

    While both num_distinct and num_rows are equal for a unique column
    both the concept and formula use row count (10,000) and the
    field num_distinct has already been set to mean the specific value
    present in the pg_stats table (i.e, -1), so use num_rows here.

diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml
index 78053d7c49..f72bc4b274 100644
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -391,18 +391,20 @@ tablename  | null_frac | n_distinct | most_common_vals
 </programlisting>

    In this case there is no <acronym>MCV</acronym> information for
-   <structfield>unique2</structfield> because all the values appear to be
-   unique, so we use an algorithm that relies only on the number of
-   distinct values for both relations together with their null fractions:
+   <structname>unique2</structname> and all the values appear to be
+   unique (n_distinct = -1), so we use an algorithm that relies on the row
+   count estimates for both relations (num_rows, not shown, but "tenk")
+   together with the column null fractions (zero for both):

 <programlisting>
-selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1,
1/num_distinct2)
+selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1,
num_rows2)
             = (1 - 0) * (1 - 0) / max(10000, 10000)
             = 0.0001
 </programlisting>

    This is, subtract the null fraction from one for each of the relations,
-   and divide by the maximum of the numbers of distinct values.
+   and divide by the row count of the larger relation (this value does get
+   scaled in the non-unique case).
    The number of rows
    that the join is likely to emit is calculated as the cardinality of the
    Cartesian product of the two inputs, multiplied by the


Attachments:

  [application/octet-stream] v0001-doc-make-row-estimation-example-match-prose.patch (2.4K, 3-v0001-doc-make-row-estimation-example-match-prose.patch)
  download | inline diff:
commit 73fa486a855d75d74a1a695bb350bfbfe27c7751
Author: David G. Johnston <[email protected]>
Date:   Tue Apr 12 21:23:53 2022 +0000

    doc: make unique non-null join selectivity example match the prose
    
    The description of the computation for the unique, non-null,
    join selectivity describes a division by the maximum of two values,
    while the example shows a multiplication by their reciprocal.  While
    equivalent the max phrasing is easier to understand; which seems
    more important here than precisely adhering to the formula use
    in the code (for which either variant is still an approximation).
    
    While both num_distinct and num_rows are equal for a unique column
    both the concept and formula use row count (10,000) and the
    field num_distinct has already been set to mean the specific value
    present in the pg_stats table (i.e, -1), so use num_rows here.

diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml
index 78053d7c49..f72bc4b274 100644
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -391,18 +391,20 @@ tablename  | null_frac | n_distinct | most_common_vals
 </programlisting>
 
    In this case there is no <acronym>MCV</acronym> information for
-   <structfield>unique2</structfield> because all the values appear to be
-   unique, so we use an algorithm that relies only on the number of
-   distinct values for both relations together with their null fractions:
+   <structname>unique2</structname> and all the values appear to be
+   unique (n_distinct = -1), so we use an algorithm that relies on the row
+   count estimates for both relations (num_rows, not shown, but "tenk")
+   together with the column null fractions (zero for both):
 
 <programlisting>
-selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
+selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1, num_rows2)
             = (1 - 0) * (1 - 0) / max(10000, 10000)
             = 0.0001
 </programlisting>
 
    This is, subtract the null fraction from one for each of the relations,
-   and divide by the maximum of the numbers of distinct values.
+   and divide by the row count of the larger relation (this value does get
+   scaled in the non-unique case).
    The number of rows
    that the join is likely to emit is calculated as the cardinality of the
    Cartesian product of the two inputs, multiplied by the


^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2022-04-12 21:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2021-08-09 12:53 incorrect information in documentation PG Doc comments form <[email protected]>
2021-08-09 16:06 ` David G. Johnston <[email protected]>
2021-08-09 16:20   ` David G. Johnston <[email protected]>
2021-08-09 18:05     ` Bruce Momjian <[email protected]>
2021-08-10 03:40       ` David G. Johnston <[email protected]>
2022-04-12 21:36         ` David G. Johnston <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox