Re: [PATCH] Fix overflow and underflow in regr

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: [PATCH] Fix overflow and underflow in regr_r2()
3+ messages / 2 participants
[nested] [flat]

* Re: [PATCH] Fix overflow and underflow in regr_r2()
@ 2026-05-23 01:14  Chengpeng Yan <[email protected]>
  1 sibling, 0 replies; 3+ messages in thread

From: Chengpeng Yan @ 2026-05-23 01:14 UTC (permalink / raw)
  To: Dean Rasheed <[email protected]>; +Cc: pgsql-hackers; Tom Lane <[email protected]>

Hi,

> On May 16, 2026, at 17:39, Dean Rasheed <[email protected]> wrote:
> 
>> corr() already has a stabilized calculation for the same Sxx * Syy
>> denominator scale. This patch factors that into a helper and lets
>> regr_r2() use it as a fallback when one of its direct products has
>> rounded to zero or infinity. Otherwise, regr_r2() keeps the existing
>> direct formula.
> 
> The comments need work -- in particular float8_regr_r2() needs a
> comment explaining the new overflow/underflow checks, similar to the
> comment in float8_corr(). In fact, doing that, I think it's preferable
> to just keep this change local to float8_regr_r2(), rather than
> refactoring into a helper function for just a few lines of code.
> 
> This new check in float8_regr_r2():
> 
> +   if (Sxy == 0 && !isnan(Sxx) && !isnan(Syy))
> +       PG_RETURN_FLOAT8(0.0);
> 
> seems pointless. It's optimising for a special case that will very
> rarely occur in practice, and which is handled fine by the general
> code. We don't want to slow down the common code path for such rare
> special cases.
> 
> I noticed that this new overflow test case:
> 
> +SELECT regr_r2(1e154::float8 * g, 1e154::float8 * g)
> +  FROM generate_series(1, 2) g;
> + regr_r2
> +---------
> +       1
> +(1 row)
> 
> only produces 1 because it's run with a reduced extra_float_digits
> value. I think it's better to use the test values "1e100 + g * 1e95",
> which still trigger the overflow on HEAD, but more reliably produce 1,
> regardless of the extra_float_digits setting, making it less likely
> that there will be variations between platforms. That's also more
> consistent with the other nearby test cases.

Thanks for the thorough review and feedback — I learned a lot from it!

> Attached is a v2 patch with those changes, plus a little more tidying
> up of the regression tests.

v2 LGTM. Thanks for the updates and test cleanup.

--
Best regards,
Chengpeng Yan


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: [PATCH] Fix overflow and underflow in regr_r2()
@ 2026-05-23 02:42  Chengpeng Yan <[email protected]>
  1 sibling, 1 reply; 3+ messages in thread

From: Chengpeng Yan @ 2026-05-23 02:42 UTC (permalink / raw)
  To: Dean Rasheed <[email protected]>; +Cc: Tom Lane <[email protected]>; pgsql-hackers

Hi,

> On May 17, 2026, at 17:16, Dean Rasheed <[email protected]> wrote:
> 
> OK, here's a more complete patch along those lines, intended to apply
> on top of the regr_r2() patch.

Thanks for the regr_intercept.patch. The approach looks good to me.

I only noticed a few small things:

1. The patch file seems to have a format issue and doesn't apply
directly. `git apply` reports:

```
error: git apply: bad git-diff - expected /dev/null on line 2
```

2. `dy` seems a bit hard to understand. Perhaps `offset`, as used in the
earlier sketch, would be clearer?

3. Do we need to add tests for the underflow path, and perhaps for the
Inf/NaN guard?

Other than that, this looks good to me.

--
Best regards,
Chengpeng Yan

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: [PATCH] Fix overflow and underflow in regr_r2()
@ 2026-05-28 12:37  Dean Rasheed <[email protected]>
  parent: Chengpeng Yan <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: Dean Rasheed @ 2026-05-28 12:37 UTC (permalink / raw)
  To: Chengpeng Yan <[email protected]>; [email protected]; +Cc: Tom Lane <[email protected]>; pgsql-hackers

On Sat, 23 May 2026 at 03:42, Chengpeng Yan <[email protected]> wrote:
>
> Thanks for the regr_intercept.patch. The approach looks good to me.

Thanks for reviewing, and sorry for the delay getting back to you.

> 2. `dy` seems a bit hard to understand. Perhaps `offset`, as used in the
> earlier sketch, would be clearer?

[Shrug] I think dy is common enough to denote a difference in
y-values, and it seems clear enough, given the large comment above it.

> 3. Do we need to add tests for the underflow path, and perhaps for the
> Inf/NaN guard?

Yeah, I think it makes sense to include a test with underflow, since
that really can lead to a large relative error. I don't think it's
worth testing the Inf/NaN guard, since that's more about avoiding
operating on technically uninitialised variables, and I don't believe
that it actually affects the results.

I've add this test case:

SELECT regr_intercept(y, x) FROM (VALUES (-1e-131, 0), (2e-131,
3e-131)) v(x, y);

Here, directly computing Sx * Sxy / Sxx causes an underflow to zero,
while the correct result should be 1e-131. Since Sy is 3e-131, this
makes a noticeable difference to the final result (without the patch,
it returns an intercept of 1.5e-131, whereas with the patch, it
correctly returns 1e-131).

If there are no objections from the RMT, I'll push both of these (to
HEAD only) in a couple of days or so.

Regards,
Dean


Attachments:

  [text/x-patch] v2-0001-Improve-overflow-underflow-handling-in-regr_r2.patch (7.2K, 2-v2-0001-Improve-overflow-underflow-handling-in-regr_r2.patch)
  download | inline diff:
From 6df152d526d28c38272d5ff17efcd2ba184f5649 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <[email protected]>
Date: Sat, 16 May 2026 09:07:22 +0100
Subject: [PATCH v2 1/2] Improve overflow/underflow handling in regr_r2().

Commit 6498287696d improved corr()'s final function to cope with
overflow/underflow in the final calculation, and clamped its result to
[-1, 1] in case of roundoff error. Improve regr_r2() in a similar way,
clamping its result to [0, 1].

Arguably this is a bug fix, but given the lack of prior complaints,
refrain from back-patching, as we did with 6498287696d.

Reported-by: Chengpeng Yan <[email protected]>
Author: Chengpeng Yan <[email protected]>
Reviewed-by: Dean Rasheed <[email protected]>
Reviewed-by: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/backend/utils/adt/float.c            | 37 ++++++++++++++-
 src/test/regress/expected/aggregates.out | 58 ++++++++++++++++--------
 src/test/regress/sql/aggregates.sql      | 19 ++++++--
 3 files changed, 87 insertions(+), 27 deletions(-)

diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 362c29ab803..cc00c10c0d4 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -3916,7 +3916,12 @@ float8_regr_r2(PG_FUNCTION_ARGS)
 	float8		N,
 				Sxx,
 				Syy,
-				Sxy;
+				Sxy,
+				numerator,
+				denominator,
+				sqrtdenominator,
+				sqrtresult,
+				result;
 
 	transvalues = check_float8_array(transarray, "float8_regr_r2", 8);
 	N = transvalues[0];
@@ -3938,7 +3943,35 @@ float8_regr_r2(PG_FUNCTION_ARGS)
 	if (Syy == 0)
 		PG_RETURN_FLOAT8(1.0);
 
-	PG_RETURN_FLOAT8((Sxy * Sxy) / (Sxx * Syy));
+	/*
+	 * The products Sxy * Sxy and/or Sxx * Syy might underflow or overflow. If
+	 * so, we can recover by computing Sxy / (sqrt(Sxx) * sqrt(Syy)) and
+	 * squaring it instead.  However, the double sqrt() calculation is a bit
+	 * slower and less accurate, so don't do it if we don't have to.
+	 */
+	numerator = Sxy * Sxy;
+	denominator = Sxx * Syy;
+	if (numerator == 0 || isinf(numerator) ||
+		denominator == 0 || isinf(denominator))
+	{
+		sqrtdenominator = sqrt(Sxx) * sqrt(Syy);
+		sqrtresult = Sxy / sqrtdenominator;
+		result = sqrtresult * sqrtresult;
+	}
+	else
+		result = numerator / denominator;
+
+	/*
+	 * Despite all these precautions, this formula can yield results outside
+	 * [0, 1] due to roundoff error.  Clamp it to the expected range.
+	 *
+	 * Note that result is guaranteed to be non-negative becase Sxx and Syy
+	 * are non-negative, so we only need to clamp the upper end of the range.
+	 */
+	if (result > 1)
+		result = 1;
+
+	PG_RETURN_FLOAT8(result);
 }
 
 Datum
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index fbda0e3bbc2..1ccdf7dfdd7 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -516,6 +516,7 @@ SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
 (1 row)
 
 -- check some cases that formerly had poor roundoff-error behavior
+-- note: regr_r2() differs from corr() for a horizontal line, per spec
 SELECT corr(0.09, g), regr_r2(0.09, g)
   FROM generate_series(1, 30) g;
  corr | regr_r2 
@@ -537,38 +538,55 @@ SELECT corr(1.3 + g * 1e-16, 1.3 + g * 1e-16)
      
 (1 row)
 
-SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+-- check some cases that formerly suffered from internal overflow/underflow
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105),
+       regr_r2(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
   FROM generate_series(1, 3) g;
- corr 
-------
-    1
+ corr | regr_r2 
+------+---------
+    1 |       1
 (1 row)
 
-SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105),
+       regr_r2(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
   FROM generate_series(1, 30) g;
- corr 
-------
-    1
+ corr | regr_r2 
+------+---------
+    1 |       1
+(1 row)
+
+SELECT corr(1e100 + g * 1e95, 1e100 + g * 1e95),
+       regr_r2(1e100 + g * 1e95, 1e100 + g * 1e95)
+  FROM generate_series(1, 2) g;
+ corr | regr_r2 
+------+---------
+    1 |       1
 (1 row)
 
 -- these examples pose definitional questions for NaN inputs,
 -- which we resolve by saying that an all-NaN input column is not all equal
-SELECT corr(g, 'NaN') FROM generate_series(1, 30) g;
- corr 
-------
-  NaN
+SELECT corr(g, 'NaN'), regr_r2(g, 'NaN') FROM generate_series(1, 30) g;
+ corr | regr_r2 
+------+---------
+  NaN |     NaN
 (1 row)
 
-SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
- corr 
-------
-     
+SELECT corr(0.1, 'NaN'), regr_r2(0.1, 'NaN') FROM generate_series(1, 30) g;
+ corr | regr_r2 
+------+---------
+      |       1
 (1 row)
 
-SELECT corr('NaN', 'NaN') FROM generate_series(1, 30) g;
- corr 
-------
-  NaN
+SELECT corr('NaN', 0.1), regr_r2('NaN', 0.1) FROM generate_series(1, 30) g;
+ corr | regr_r2 
+------+---------
+      |        
+(1 row)
+
+SELECT corr('NaN', 'NaN'), regr_r2('NaN', 'NaN') FROM generate_series(1, 30) g;
+ corr | regr_r2 
+------+---------
+  NaN |     NaN
 (1 row)
 
 -- test accum and combine functions directly
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 580f364ba97..a310b39e7b8 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -141,22 +141,31 @@ SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8);
 SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
 
 -- check some cases that formerly had poor roundoff-error behavior
+-- note: regr_r2() differs from corr() for a horizontal line, per spec
 SELECT corr(0.09, g), regr_r2(0.09, g)
   FROM generate_series(1, 30) g;
 SELECT corr(g, 0.09), regr_r2(g, 0.09), regr_slope(g, 0.09), regr_intercept(g, 0.09)
   FROM generate_series(1, 30) g;
 SELECT corr(1.3 + g * 1e-16, 1.3 + g * 1e-16)
   FROM generate_series(1, 3) g;
-SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+
+-- check some cases that formerly suffered from internal overflow/underflow
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105),
+       regr_r2(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
   FROM generate_series(1, 3) g;
-SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105),
+       regr_r2(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
   FROM generate_series(1, 30) g;
+SELECT corr(1e100 + g * 1e95, 1e100 + g * 1e95),
+       regr_r2(1e100 + g * 1e95, 1e100 + g * 1e95)
+  FROM generate_series(1, 2) g;
 
 -- these examples pose definitional questions for NaN inputs,
 -- which we resolve by saying that an all-NaN input column is not all equal
-SELECT corr(g, 'NaN') FROM generate_series(1, 30) g;
-SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
-SELECT corr('NaN', 'NaN') FROM generate_series(1, 30) g;
+SELECT corr(g, 'NaN'), regr_r2(g, 'NaN') FROM generate_series(1, 30) g;
+SELECT corr(0.1, 'NaN'), regr_r2(0.1, 'NaN') FROM generate_series(1, 30) g;
+SELECT corr('NaN', 0.1), regr_r2('NaN', 0.1) FROM generate_series(1, 30) g;
+SELECT corr('NaN', 'NaN'), regr_r2('NaN', 'NaN') FROM generate_series(1, 30) g;
 
 -- test accum and combine functions directly
 CREATE TABLE regr_test (x float8, y float8);
-- 
2.51.0



  [text/x-patch] v2-0002-Improve-overflow-underflow-handling-in-regr_inter.patch (4.3K, 3-v2-0002-Improve-overflow-underflow-handling-in-regr_inter.patch)
  download | inline diff:
From 01881f2ea25dfa2ea93465ece26cc3e8a0ae8593 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <[email protected]>
Date: Thu, 28 May 2026 12:51:09 +0100
Subject: [PATCH v2 2/2] Improve overflow/underflow handling in
 regr_intercept().

As with corr() and regr_r2(), improve regr_intercept()'s final
function to cope with overflow/underflow in the final calculation.
Here, instead of using sqrt(), we use frexp() and ldexp() to recover,
if an overflow or underflow is detected, so that the multiplication
and division steps operate on normalised mantissas, and cannot
overflow or underflow.

As with 6498287696d, and the previous commit improving regr_r2(), this
is arguably a bug fix, but given the lack of prior complaints, refrain
from back-patching.

Reported-by: Tom Lane <[email protected]>
Author: Dean Rasheed <[email protected]>
Reviewed-by: Chengpeng Yan <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/backend/utils/adt/float.c            | 39 ++++++++++++++++++++++--
 src/test/regress/expected/aggregates.out | 12 ++++++++
 src/test/regress/sql/aggregates.sql      |  2 ++
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index cc00c10c0d4..262ea2b73ba 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -4010,7 +4010,8 @@ float8_regr_intercept(PG_FUNCTION_ARGS)
 				Sx,
 				Sxx,
 				Sy,
-				Sxy;
+				Sxy,
+				dy;
 
 	transvalues = check_float8_array(transarray, "float8_regr_intercept", 8);
 	N = transvalues[0];
@@ -4029,7 +4030,41 @@ float8_regr_intercept(PG_FUNCTION_ARGS)
 	if (Sxx == 0)
 		PG_RETURN_NULL();
 
-	PG_RETURN_FLOAT8((Sy - Sx * Sxy / Sxx) / N);
+	/*
+	 * The intercept is given by (Sy - dy) / N, where dy = Sx * Sxy / Sxx.
+	 * However, when computing dy, the intermediate product Sx * Sxy might
+	 * underflow or overflow.  If so, we can recover by decomposing Sx, Sxy,
+	 * and Sxx into normalized mantissa and integer power-of-two components,
+	 * computing the corresponding components of dy, and then recomposing dy.
+	 * We avoid doing this if Sx, Sxy, or Sxx are infinite or NaN, since the
+	 * exponent returned by frexp() is unspecified in those cases (and the
+	 * final result would be the same in any case).
+	 */
+	dy = Sx * Sxy / Sxx;
+	if ((dy == 0 || isinf(dy)) &&
+		!(isinf(Sx) || isinf(Sxy) || isinf(Sxx) ||
+		  isnan(Sx) || isnan(Sxy) || isnan(Sxx)))
+	{
+		float8		m_Sx,
+					m_Sxy,
+					m_Sxx,
+					m_dy;
+		int			n_Sx,
+					n_Sxy,
+					n_Sxx,
+					n_dy;
+
+		m_Sx = frexp(Sx, &n_Sx);
+		m_Sxy = frexp(Sxy, &n_Sxy);
+		m_Sxx = frexp(Sxx, &n_Sxx);
+
+		m_dy = m_Sx * m_Sxy / m_Sxx;
+		n_dy = n_Sx + n_Sxy - n_Sxx;
+
+		dy = ldexp(m_dy, n_dy);
+	}
+
+	PG_RETURN_FLOAT8((Sy - dy) / N);
 }
 
 
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1ccdf7dfdd7..89e051ee824 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -563,6 +563,18 @@ SELECT corr(1e100 + g * 1e95, 1e100 + g * 1e95),
     1 |       1
 (1 row)
 
+SELECT regr_intercept(y, x) FROM (VALUES (-1e150, 0), (2e150, 3e150)) v(x, y);
+ regr_intercept 
+----------------
+         1e+150
+(1 row)
+
+SELECT regr_intercept(y, x) FROM (VALUES (-1e-131, 0), (2e-131, 3e-131)) v(x, y);
+ regr_intercept 
+----------------
+         1e-131
+(1 row)
+
 -- these examples pose definitional questions for NaN inputs,
 -- which we resolve by saying that an all-NaN input column is not all equal
 SELECT corr(g, 'NaN'), regr_r2(g, 'NaN') FROM generate_series(1, 30) g;
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index a310b39e7b8..916383db927 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -159,6 +159,8 @@ SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105),
 SELECT corr(1e100 + g * 1e95, 1e100 + g * 1e95),
        regr_r2(1e100 + g * 1e95, 1e100 + g * 1e95)
   FROM generate_series(1, 2) g;
+SELECT regr_intercept(y, x) FROM (VALUES (-1e150, 0), (2e150, 3e150)) v(x, y);
+SELECT regr_intercept(y, x) FROM (VALUES (-1e-131, 0), (2e-131, 3e-131)) v(x, y);
 
 -- these examples pose definitional questions for NaN inputs,
 -- which we resolve by saying that an all-NaN input column is not all equal
-- 
2.51.0



^ permalink  raw  reply  [nested|flat] 3+ messages in thread

end of thread, other threads:[~2026-05-28 12:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-05-23 01:14 ` Chengpeng Yan <[email protected]>
2026-05-23 02:42 ` Chengpeng Yan <[email protected]>
2026-05-28 12:37   ` Dean Rasheed <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox