Re: [PATCH] Fix overflow and underflow in regr_r2()

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Dean Rasheed <[email protected]>
To: Chengpeng Yan <[email protected]>
To: [email protected]
Cc: Tom Lane <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: [PATCH] Fix overflow and underflow in regr_r2()
Date: Thu, 28 May 2026 13:37:54 +0100
Message-ID: <CAEZATCWwqMUHnjNzBday-fh15oPNqec3vLnqj7A8-G+vnmQ0RQ@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<CAEZATCUaV+qmBCh0zv0EPdBvhE2skHCzkRw78VNjBMCX9Z7h+w@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<CAEZATCXpUwijoE2imbwWWraG63SmzgXyE+B8eoFNennu87-=kw@mail.gmail.com>
	<[email protected]>
	<CAEZATCWDfB8QQeqrhD2_BvDrPOF5b=nskOWaRmzMGvyntv5Tnw@mail.gmail.com>
	<[email protected]>
	<CAEZATCUAu5JOJnyy5yXTGsfSRCDnbrGfHCWRxYVPtxROzKH30A@mail.gmail.com>
	<[email protected]>

On Sat, 23 May 2026 at 03:42, Chengpeng Yan <[email protected]> wrote:
>
> Thanks for the regr_intercept.patch. The approach looks good to me.

Thanks for reviewing, and sorry for the delay getting back to you.

> 2. `dy` seems a bit hard to understand. Perhaps `offset`, as used in the
> earlier sketch, would be clearer?

[Shrug] I think dy is common enough to denote a difference in
y-values, and it seems clear enough, given the large comment above it.

> 3. Do we need to add tests for the underflow path, and perhaps for the
> Inf/NaN guard?

Yeah, I think it makes sense to include a test with underflow, since
that really can lead to a large relative error. I don't think it's
worth testing the Inf/NaN guard, since that's more about avoiding
operating on technically uninitialised variables, and I don't believe
that it actually affects the results.

I've add this test case:

SELECT regr_intercept(y, x) FROM (VALUES (-1e-131, 0), (2e-131,
3e-131)) v(x, y);

Here, directly computing Sx * Sxy / Sxx causes an underflow to zero,
while the correct result should be 1e-131. Since Sy is 3e-131, this
makes a noticeable difference to the final result (without the patch,
it returns an intercept of 1.5e-131, whereas with the patch, it
correctly returns 1e-131).

If there are no objections from the RMT, I'll push both of these (to
HEAD only) in a couple of days or so.

Regards,
Dean


Attachments:

  [text/x-patch] v2-0001-Improve-overflow-underflow-handling-in-regr_r2.patch (7.2K, 2-v2-0001-Improve-overflow-underflow-handling-in-regr_r2.patch)
  download | inline diff:
From 6df152d526d28c38272d5ff17efcd2ba184f5649 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <[email protected]>
Date: Sat, 16 May 2026 09:07:22 +0100
Subject: [PATCH v2 1/2] Improve overflow/underflow handling in regr_r2().

Commit 6498287696d improved corr()'s final function to cope with
overflow/underflow in the final calculation, and clamped its result to
[-1, 1] in case of roundoff error. Improve regr_r2() in a similar way,
clamping its result to [0, 1].

Arguably this is a bug fix, but given the lack of prior complaints,
refrain from back-patching, as we did with 6498287696d.

Reported-by: Chengpeng Yan <[email protected]>
Author: Chengpeng Yan <[email protected]>
Reviewed-by: Dean Rasheed <[email protected]>
Reviewed-by: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/backend/utils/adt/float.c            | 37 ++++++++++++++-
 src/test/regress/expected/aggregates.out | 58 ++++++++++++++++--------
 src/test/regress/sql/aggregates.sql      | 19 ++++++--
 3 files changed, 87 insertions(+), 27 deletions(-)

diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 362c29ab803..cc00c10c0d4 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -3916,7 +3916,12 @@ float8_regr_r2(PG_FUNCTION_ARGS)
 	float8		N,
 				Sxx,
 				Syy,
-				Sxy;
+				Sxy,
+				numerator,
+				denominator,
+				sqrtdenominator,
+				sqrtresult,
+				result;
 
 	transvalues = check_float8_array(transarray, "float8_regr_r2", 8);
 	N = transvalues[0];
@@ -3938,7 +3943,35 @@ float8_regr_r2(PG_FUNCTION_ARGS)
 	if (Syy == 0)
 		PG_RETURN_FLOAT8(1.0);
 
-	PG_RETURN_FLOAT8((Sxy * Sxy) / (Sxx * Syy));
+	/*
+	 * The products Sxy * Sxy and/or Sxx * Syy might underflow or overflow. If
+	 * so, we can recover by computing Sxy / (sqrt(Sxx) * sqrt(Syy)) and
+	 * squaring it instead.  However, the double sqrt() calculation is a bit
+	 * slower and less accurate, so don't do it if we don't have to.
+	 */
+	numerator = Sxy * Sxy;
+	denominator = Sxx * Syy;
+	if (numerator == 0 || isinf(numerator) ||
+		denominator == 0 || isinf(denominator))
+	{
+		sqrtdenominator = sqrt(Sxx) * sqrt(Syy);
+		sqrtresult = Sxy / sqrtdenominator;
+		result = sqrtresult * sqrtresult;
+	}
+	else
+		result = numerator / denominator;
+
+	/*
+	 * Despite all these precautions, this formula can yield results outside
+	 * [0, 1] due to roundoff error.  Clamp it to the expected range.
+	 *
+	 * Note that result is guaranteed to be non-negative becase Sxx and Syy
+	 * are non-negative, so we only need to clamp the upper end of the range.
+	 */
+	if (result > 1)
+		result = 1;
+
+	PG_RETURN_FLOAT8(result);
 }
 
 Datum
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index fbda0e3bbc2..1ccdf7dfdd7 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -516,6 +516,7 @@ SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
 (1 row)
 
 -- check some cases that formerly had poor roundoff-error behavior
+-- note: regr_r2() differs from corr() for a horizontal line, per spec
 SELECT corr(0.09, g), regr_r2(0.09, g)
   FROM generate_series(1, 30) g;
  corr | regr_r2 
@@ -537,38 +538,55 @@ SELECT corr(1.3 + g * 1e-16, 1.3 + g * 1e-16)
      
 (1 row)
 
-SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+-- check some cases that formerly suffered from internal overflow/underflow
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105),
+       regr_r2(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
   FROM generate_series(1, 3) g;
- corr 
-------
-    1
+ corr | regr_r2 
+------+---------
+    1 |       1
 (1 row)
 
-SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105),
+       regr_r2(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
   FROM generate_series(1, 30) g;
- corr 
-------
-    1
+ corr | regr_r2 
+------+---------
+    1 |       1
+(1 row)
+
+SELECT corr(1e100 + g * 1e95, 1e100 + g * 1e95),
+       regr_r2(1e100 + g * 1e95, 1e100 + g * 1e95)
+  FROM generate_series(1, 2) g;
+ corr | regr_r2 
+------+---------
+    1 |       1
 (1 row)
 
 -- these examples pose definitional questions for NaN inputs,
 -- which we resolve by saying that an all-NaN input column is not all equal
-SELECT corr(g, 'NaN') FROM generate_series(1, 30) g;
- corr 
-------
-  NaN
+SELECT corr(g, 'NaN'), regr_r2(g, 'NaN') FROM generate_series(1, 30) g;
+ corr | regr_r2 
+------+---------
+  NaN |     NaN
 (1 row)
 
-SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
- corr 
-------
-     
+SELECT corr(0.1, 'NaN'), regr_r2(0.1, 'NaN') FROM generate_series(1, 30) g;
+ corr | regr_r2 
+------+---------
+      |       1
 (1 row)
 
-SELECT corr('NaN', 'NaN') FROM generate_series(1, 30) g;
- corr 
-------
-  NaN
+SELECT corr('NaN', 0.1), regr_r2('NaN', 0.1) FROM generate_series(1, 30) g;
+ corr | regr_r2 
+------+---------
+      |        
+(1 row)
+
+SELECT corr('NaN', 'NaN'), regr_r2('NaN', 'NaN') FROM generate_series(1, 30) g;
+ corr | regr_r2 
+------+---------
+  NaN |     NaN
 (1 row)
 
 -- test accum and combine functions directly
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 580f364ba97..a310b39e7b8 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -141,22 +141,31 @@ SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8);
 SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
 
 -- check some cases that formerly had poor roundoff-error behavior
+-- note: regr_r2() differs from corr() for a horizontal line, per spec
 SELECT corr(0.09, g), regr_r2(0.09, g)
   FROM generate_series(1, 30) g;
 SELECT corr(g, 0.09), regr_r2(g, 0.09), regr_slope(g, 0.09), regr_intercept(g, 0.09)
   FROM generate_series(1, 30) g;
 SELECT corr(1.3 + g * 1e-16, 1.3 + g * 1e-16)
   FROM generate_series(1, 3) g;
-SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+
+-- check some cases that formerly suffered from internal overflow/underflow
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105),
+       regr_r2(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
   FROM generate_series(1, 3) g;
-SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
+SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105),
+       regr_r2(1e-100 + g * 1e-105, 1e-100 + g * 1e-105)
   FROM generate_series(1, 30) g;
+SELECT corr(1e100 + g * 1e95, 1e100 + g * 1e95),
+       regr_r2(1e100 + g * 1e95, 1e100 + g * 1e95)
+  FROM generate_series(1, 2) g;
 
 -- these examples pose definitional questions for NaN inputs,
 -- which we resolve by saying that an all-NaN input column is not all equal
-SELECT corr(g, 'NaN') FROM generate_series(1, 30) g;
-SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
-SELECT corr('NaN', 'NaN') FROM generate_series(1, 30) g;
+SELECT corr(g, 'NaN'), regr_r2(g, 'NaN') FROM generate_series(1, 30) g;
+SELECT corr(0.1, 'NaN'), regr_r2(0.1, 'NaN') FROM generate_series(1, 30) g;
+SELECT corr('NaN', 0.1), regr_r2('NaN', 0.1) FROM generate_series(1, 30) g;
+SELECT corr('NaN', 'NaN'), regr_r2('NaN', 'NaN') FROM generate_series(1, 30) g;
 
 -- test accum and combine functions directly
 CREATE TABLE regr_test (x float8, y float8);
-- 
2.51.0



  [text/x-patch] v2-0002-Improve-overflow-underflow-handling-in-regr_inter.patch (4.3K, 3-v2-0002-Improve-overflow-underflow-handling-in-regr_inter.patch)
  download | inline diff:
From 01881f2ea25dfa2ea93465ece26cc3e8a0ae8593 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <[email protected]>
Date: Thu, 28 May 2026 12:51:09 +0100
Subject: [PATCH v2 2/2] Improve overflow/underflow handling in
 regr_intercept().

As with corr() and regr_r2(), improve regr_intercept()'s final
function to cope with overflow/underflow in the final calculation.
Here, instead of using sqrt(), we use frexp() and ldexp() to recover,
if an overflow or underflow is detected, so that the multiplication
and division steps operate on normalised mantissas, and cannot
overflow or underflow.

As with 6498287696d, and the previous commit improving regr_r2(), this
is arguably a bug fix, but given the lack of prior complaints, refrain
from back-patching.

Reported-by: Tom Lane <[email protected]>
Author: Dean Rasheed <[email protected]>
Reviewed-by: Chengpeng Yan <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/backend/utils/adt/float.c            | 39 ++++++++++++++++++++++--
 src/test/regress/expected/aggregates.out | 12 ++++++++
 src/test/regress/sql/aggregates.sql      |  2 ++
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index cc00c10c0d4..262ea2b73ba 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -4010,7 +4010,8 @@ float8_regr_intercept(PG_FUNCTION_ARGS)
 				Sx,
 				Sxx,
 				Sy,
-				Sxy;
+				Sxy,
+				dy;
 
 	transvalues = check_float8_array(transarray, "float8_regr_intercept", 8);
 	N = transvalues[0];
@@ -4029,7 +4030,41 @@ float8_regr_intercept(PG_FUNCTION_ARGS)
 	if (Sxx == 0)
 		PG_RETURN_NULL();
 
-	PG_RETURN_FLOAT8((Sy - Sx * Sxy / Sxx) / N);
+	/*
+	 * The intercept is given by (Sy - dy) / N, where dy = Sx * Sxy / Sxx.
+	 * However, when computing dy, the intermediate product Sx * Sxy might
+	 * underflow or overflow.  If so, we can recover by decomposing Sx, Sxy,
+	 * and Sxx into normalized mantissa and integer power-of-two components,
+	 * computing the corresponding components of dy, and then recomposing dy.
+	 * We avoid doing this if Sx, Sxy, or Sxx are infinite or NaN, since the
+	 * exponent returned by frexp() is unspecified in those cases (and the
+	 * final result would be the same in any case).
+	 */
+	dy = Sx * Sxy / Sxx;
+	if ((dy == 0 || isinf(dy)) &&
+		!(isinf(Sx) || isinf(Sxy) || isinf(Sxx) ||
+		  isnan(Sx) || isnan(Sxy) || isnan(Sxx)))
+	{
+		float8		m_Sx,
+					m_Sxy,
+					m_Sxx,
+					m_dy;
+		int			n_Sx,
+					n_Sxy,
+					n_Sxx,
+					n_dy;
+
+		m_Sx = frexp(Sx, &n_Sx);
+		m_Sxy = frexp(Sxy, &n_Sxy);
+		m_Sxx = frexp(Sxx, &n_Sxx);
+
+		m_dy = m_Sx * m_Sxy / m_Sxx;
+		n_dy = n_Sx + n_Sxy - n_Sxx;
+
+		dy = ldexp(m_dy, n_dy);
+	}
+
+	PG_RETURN_FLOAT8((Sy - dy) / N);
 }
 
 
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 1ccdf7dfdd7..89e051ee824 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -563,6 +563,18 @@ SELECT corr(1e100 + g * 1e95, 1e100 + g * 1e95),
     1 |       1
 (1 row)
 
+SELECT regr_intercept(y, x) FROM (VALUES (-1e150, 0), (2e150, 3e150)) v(x, y);
+ regr_intercept 
+----------------
+         1e+150
+(1 row)
+
+SELECT regr_intercept(y, x) FROM (VALUES (-1e-131, 0), (2e-131, 3e-131)) v(x, y);
+ regr_intercept 
+----------------
+         1e-131
+(1 row)
+
 -- these examples pose definitional questions for NaN inputs,
 -- which we resolve by saying that an all-NaN input column is not all equal
 SELECT corr(g, 'NaN'), regr_r2(g, 'NaN') FROM generate_series(1, 30) g;
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index a310b39e7b8..916383db927 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -159,6 +159,8 @@ SELECT corr(1e-100 + g * 1e-105, 1e-100 + g * 1e-105),
 SELECT corr(1e100 + g * 1e95, 1e100 + g * 1e95),
        regr_r2(1e100 + g * 1e95, 1e100 + g * 1e95)
   FROM generate_series(1, 2) g;
+SELECT regr_intercept(y, x) FROM (VALUES (-1e150, 0), (2e150, 3e150)) v(x, y);
+SELECT regr_intercept(y, x) FROM (VALUES (-1e-131, 0), (2e-131, 3e-131)) v(x, y);
 
 -- these examples pose definitional questions for NaN inputs,
 -- which we resolve by saying that an all-NaN input column is not all equal
-- 
2.51.0

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: [PATCH] Fix overflow and underflow in regr_r2()
  In-Reply-To: <CAEZATCWwqMUHnjNzBday-fh15oPNqec3vLnqj7A8-G+vnmQ0RQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox