MIME-Version: 1.0
From: Dmytro Astapov <dastapov@gmail.com>
Date: Fri, 27 Mar 2026 12:29:12 +0000
Message-ID: <CAFQUnFj2pQ1HbGp69+w2fKqARSfGhAi9UOb+JjyExp7kx3gsqA@mail.gmail.com>
Subject: array_agg(anyarray) silently produces corrupt results with parallel
 workers when inputs mix NULL and non-NULL array elements
To: pgsql-bugs@lists.postgresql.org
Content-Type: multipart/mixed; boundary="000000000000f31918064e00a684"
Archived-At: <https://www.postgresql.org/message-id/CAFQUnFj2pQ1HbGp69%2Bw2fKqARSfGhAi9UOb%2BJjyExp7kx3gsqA%40mail.gmail.com>
Precedence: bulk

--000000000000f31918064e00a684
Content-Type: multipart/alternative; boundary="000000000000f31917064e00a682"

--000000000000f31917064e00a682
Content-Type: text/plain; charset="UTF-8"

PostgreSQL version: 17.2 (also verified on 17.9 and 18.3)
Operating system:   Linux x86_64 (Red Hat 8)

Short description
-----------------

array_agg(ARRAY[...]) produces silently corrupted 2-D arrays when the query
uses parallel partial aggregation and the input arrays contain a mix of
NULL and non-NULL elements. NULL values appear at the wrong positions in
the output, and non-NULL values disappear or shift.

The corruption is non-deterministic: two identical queries against the same
data can return different results.

Disabling parallelism with SET max_parallel_workers_per_gather = 0
eliminates the problem.

How to reproduce
----------------

The bug requires the planner to choose a Partial GroupAggregate (or Partial
HashAggregate) plan with parallel workers. This typically needs a table
large enough for the planner to prefer parallel aggregation (I used ~10M
rows in my tests).

-- Create and populate a test table with ~10M rows.
-- Three NULL patterns in m1/m2/m3 columns:
--   ~68% rows: m1=val, m2=val, m3='' (no NULLs)
--   ~19% rows: m1=NULL, m2=NULL, m3=NULL (all NULL)
--   ~13% rows: m1=val, m2='', m3=val (no NULLs)

CREATE TABLE test_data (
    group_id    int NOT NULL,
    branch_name text NOT NULL,
    branch_type text NOT NULL,
    m1          text,
    m2          text,
    m3          text
);

INSERT INTO test_data
SELECT
    (i / 4),
    'BRANCH_' || i,
    'Type' || (i % 3),
    CASE WHEN i%100<68 THEN 'val_'||i WHEN i%100<87 THEN NULL ELSE
'val_'||i END,
    CASE WHEN i%100<68 THEN 'v2_'||i  WHEN i%100<87 THEN NULL ELSE '' END,
    CASE WHEN i%100<68 THEN ''        WHEN i%100<87 THEN NULL ELSE 'v3_'||i
END
FROM generate_series(1, 2000000) AS g(i);

-- Replicate to ~10M rows so the planner picks a parallel plan.
INSERT INTO test_data
SELECT group_id + 500001, branch_name || '_r' || r, branch_type, m1, m2, m3
FROM test_data, generate_series(1, 4) r;

ANALYZE test_data;

-- Verify the plan uses Partial GroupAggregate with parallel workers:
SET max_parallel_workers_per_gather = 4;
EXPLAIN (COSTS OFF)
SELECT group_id % 100000 AS gid, array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s
FROM test_data GROUP BY group_id % 100000;

-- Expected plan:
--   Finalize GroupAggregate
--     ->  Gather Merge
--           Workers Planned: 4
--           ->  Partial GroupAggregate
--                 ->  Sort
--                       ->  Parallel Seq Scan on test_data

-- Create ground truth (no parallelism):
SET max_parallel_workers_per_gather = 0;
CREATE TABLE gt AS
SELECT group_id % 100000 AS gid,
       array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s
FROM test_data GROUP BY group_id % 100000;

-- Create parallel result:
SET max_parallel_workers_per_gather = 4;
CREATE TABLE pr AS
SELECT group_id % 100000 AS gid,
       array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s
FROM test_data GROUP BY group_id % 100000;

-- Order-independent comparison (eliminates row-ordering differences,
-- detects only value corruption):
SELECT count(*) AS corrupted_groups
FROM (
    SELECT gid, array_agg(COALESCE(v, '!!NULL!!') ORDER BY v) AS sv
    FROM gt, unnest(m1m2m3s) v GROUP BY gid
) a
JOIN (
    SELECT gid, array_agg(COALESCE(v, '!!NULL!!') ORDER BY v) AS sv
    FROM pr, unnest(m1m2m3s) v GROUP BY gid
) b ON a.gid = b.gid
WHERE a.sv != b.sv;

On every run I have tested, corrupted_groups is in the range of 4000-6000
out of 100000 total groups (~5%). Results differ between runs, confirming
non-determinism.

I verified this on three versions: REL_17_2, REL_17_9, REL_18_3. All three
versions were built with: ./configure --enable-debug --enable-cassert


Root cause
----------

The bug is seemingly in array_agg_array_combine() in
src/backend/utils/adt/array_userfuncs.c.

The combine function is used during parallel aggregation of
array_agg(anyarray).
It was introduced in commit 16fd03e9565 ("Allow parallel aggregate on
string_agg and array_agg", 2023-01-23), first shipped in PG 16.

When two partial aggregation states are combined, array_agg_array_combine
must merge their null bitmaps. The current code only enters the
bitmap-handling block when state2 (the incoming partial state) has a
nullbitmap:

    if (state2->nullbitmap)
    {
        ...
    }

I think this misses the case where state1 (the running state) already has a
nullbitmap but state2 does not. In that scenario, state2's data bytes are
appended to state1's data buffer and state1->nitems is incremented, but the
nullbitmap is NOT extended to cover state2's items. The bit positions for
state2's items are left as uninitialized memory, which randomly marks some
elements as NULL. This shifts the interpretation of the data buffer and
corrupts the output.

For comparison, the non-parallel accumArrayResultArr() in arrayfuncs.c has
this condition:

    if (astate->nullbitmap || ARR_HASNULL(arg))

which enters the bitmap block whenever EITHER the existing state has a
bitmap OR the new input has NULLs.

This bug triggers when parallel workers split a group's rows such that one
worker sees only NULL-containing arrays (building a state with a
nullbitmap) and another sees only non-NULL arrays (no nullbitmap), and the
combine function is called with the NULL-containing state as state1 and the
non-NULL state as state2. Since row distribution accross workers is
non-deterministic, the corruption is too.

Fixing this condition would require a second, related change in the same
function. When extending the bitmap, the code computes:

    int newaitems = state1->aitems + state2->aitems;

With the corrected condition (state1->nullbitmap || state2->nullbitmap),
state2->aitems can now be 0 (no bitmap was allocated for state2). This
makes newaitems equal to state1->aitems, which may be less than newnitems
(state1->nitems + state2->nitems).

The subsequent pg_nextpower2_32(newaitems) then allocates a bitmap which
will be too small, and array_bitmap_copy writes past the end of it.

This could be verified with --enable-cassert, running the reproduction
steps will produce "problem in alloc set ExprContext: req size > alloc
size" warnings.

Fix
---

I think that the following two changes to array_agg_array_combine() in
array_userfuncs.c fix the issue:

1. Change the condition guarding the null bitmap block from
   "if (state2->nullbitmap)" to
   "if (state1->nullbitmap || state2->nullbitmap)".

2. Change the bitmap reallocation size from
   "state1->aitems + state2->aitems" to
   "Max(state1->aitems + state2->aitems, newnitems)"
   to ensure the bitmap is always large enough.

Patch (applies cleanly to REL_16_STABLE through REL_18_STABLE and
master), I am also including the same as an attachment:

--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -997,7 +997,7 @@
  state1->data = (char *) repalloc(state1->data, state1->abytes);
  }

- if (state2->nullbitmap)
+ if (state1->nullbitmap || state2->nullbitmap)
  {
  int newnitems = state1->nitems + state2->nitems;

@@ -1015,7 +1015,8 @@
  }
  else if (newnitems > state1->aitems)
  {
- int newaitems = state1->aitems + state2->aitems;
+ int newaitems = Max(state1->aitems + state2->aitems,
+   newnitems);

  state1->aitems = pg_nextpower2_32(newaitems);
  state1->nullbitmap = (bits8 *)

I have verified that this patch eliminates the corruption on all three
versions tested (17.2, 17.9, 18.3): the corrupted_groups count drops
from ~5000 to 0 in every run.


Best regards, Dmytro

--000000000000f31917064e00a682
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><font face=3D"monospace">PostgreSQL version: 17.2 (al=
so verified on 17.9 and 18.3)<br>Operating system: =C2=A0 Linux x86_64 (Red=
 Hat 8)<br></font></div><div><font face=3D"monospace"><br></font></div><div=
><font face=3D"monospace">Short description<br>-----------------<br><br>arr=
ay_agg(ARRAY[...]) produces silently corrupted 2-D arrays when the query us=
es parallel partial aggregation and the input arrays contain a mix of NULL =
and non-NULL elements. NULL values appear at the wrong positions in the out=
put, and non-NULL values disappear or shift.</font></div><div><font face=3D=
"monospace"><br>The corruption is non-deterministic: two identical queries =
against the same data can return different results.<br><br>Disabling parall=
elism with SET max_parallel_workers_per_gather =3D 0 eliminates the problem=
.<br><br>How to reproduce<br>----------------<br><br>The bug requires the p=
lanner to choose a Partial GroupAggregate (or Partial HashAggregate) plan w=
ith parallel workers. This typically needs a table large enough for the pla=
nner to prefer parallel aggregation (I used ~10M rows in my tests).<br><br>=
-- Create and populate a test table with ~10M rows.<br>-- Three NULL patter=
ns in m1/m2/m3 columns:<br>-- =C2=A0 ~68% rows: m1=3Dval, m2=3Dval, m3=3D&#=
39;&#39; (no NULLs)<br>-- =C2=A0 ~19% rows: m1=3DNULL, m2=3DNULL, m3=3DNULL=
 (all NULL)<br>-- =C2=A0 ~13% rows: m1=3Dval, m2=3D&#39;&#39;, m3=3Dval (no=
 NULLs)<br><br>CREATE TABLE test_data (<br>=C2=A0 =C2=A0 group_id =C2=A0 =
=C2=A0int NOT NULL,<br>=C2=A0 =C2=A0 branch_name text NOT NULL,<br>=C2=A0 =
=C2=A0 branch_type text NOT NULL,<br>=C2=A0 =C2=A0 m1 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0text,<br>=C2=A0 =C2=A0 m2 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0te=
xt,<br>=C2=A0 =C2=A0 m3 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0text<br>);<br><br=
>INSERT INTO test_data<br>SELECT<br>=C2=A0 =C2=A0 (i / 4),<br>=C2=A0 =C2=A0=
 &#39;BRANCH_&#39; || i,<br>=C2=A0 =C2=A0 &#39;Type&#39; || (i % 3),<br>=C2=
=A0 =C2=A0 CASE WHEN i%100&lt;68 THEN &#39;val_&#39;||i WHEN i%100&lt;87 TH=
EN NULL ELSE &#39;val_&#39;||i END,<br>=C2=A0 =C2=A0 CASE WHEN i%100&lt;68 =
THEN &#39;v2_&#39;||i =C2=A0WHEN i%100&lt;87 THEN NULL ELSE &#39;&#39; END,=
<br>=C2=A0 =C2=A0 CASE WHEN i%100&lt;68 THEN &#39;&#39; =C2=A0 =C2=A0 =C2=
=A0 =C2=A0WHEN i%100&lt;87 THEN NULL ELSE &#39;v3_&#39;||i END<br>FROM gene=
rate_series(1, 2000000) AS g(i);<br><br>-- Replicate to ~10M rows so the pl=
anner picks a parallel plan.<br>INSERT INTO test_data<br>SELECT group_id + =
500001, branch_name || &#39;_r&#39; || r, branch_type, m1, m2, m3<br>FROM t=
est_data, generate_series(1, 4) r;<br><br>ANALYZE test_data;<br><br>-- Veri=
fy the plan uses Partial GroupAggregate with parallel workers:<br>SET max_p=
arallel_workers_per_gather =3D 4;<br>EXPLAIN (COSTS OFF)<br>SELECT group_id=
 % 100000 AS gid, array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s<br>FROM test_data=
 GROUP BY group_id % 100000;<br><br>-- Expected plan:<br>-- =C2=A0 Finalize=
 GroupAggregate<br>-- =C2=A0 =C2=A0 -&gt; =C2=A0Gather Merge<br>-- =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 Workers Planned: 4<br>-- =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 -&gt; =C2=A0Partial GroupAggregate<br>-- =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 -&gt; =C2=A0Sort<br>-- =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 -&gt; =C2=A0=
Parallel Seq Scan on test_data<br><br>-- Create ground truth (no parallelis=
m):<br>SET max_parallel_workers_per_gather =3D 0;<br>CREATE TABLE gt AS<br>=
SELECT group_id % 100000 AS gid,<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0array_agg(AR=
RAY[m1, m2, m3]) AS m1m2m3s<br>FROM test_data GROUP BY group_id % 100000;<b=
r><br>-- Create parallel result:<br>SET max_parallel_workers_per_gather =3D=
 4;<br>CREATE TABLE pr AS<br>SELECT group_id % 100000 AS gid,<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s<br>FROM test_data =
GROUP BY group_id % 100000;<br><br>-- Order-independent comparison (elimina=
tes row-ordering differences,<br>-- detects only value corruption):<br>SELE=
CT count(*) AS corrupted_groups<br>FROM (<br>=C2=A0 =C2=A0 SELECT gid, arra=
y_agg(COALESCE(v, &#39;!!NULL!!&#39;) ORDER BY v) AS sv<br>=C2=A0 =C2=A0 FR=
OM gt, unnest(m1m2m3s) v GROUP BY gid<br>) a<br>JOIN (<br>=C2=A0 =C2=A0 SEL=
ECT gid, array_agg(COALESCE(v, &#39;!!NULL!!&#39;) ORDER BY v) AS sv<br>=C2=
=A0 =C2=A0 FROM pr, unnest(m1m2m3s) v GROUP BY gid<br>) b ON a.gid =3D b.gi=
d<br>WHERE=C2=A0<a href=3D"http://a.sv/" target=3D"_blank">a.sv</a>=C2=A0!=
=3D=C2=A0<a href=3D"http://b.sv/" target=3D"_blank">b.sv</a>;<br><br>On eve=
ry run I have tested, corrupted_groups is in the range of 4000-6000 out of =
100000 total groups (~5%). Results differ between runs, confirming non-dete=
rminism.<br><br>I verified this on three versions: REL_17_2, REL_17_9, REL_=
18_3. All three versions were built with: ./configure --enable-debug --enab=
le-cassert<br><br><br>Root cause<br>----------<br><br>The bug is seemingly =
in array_agg_array_combine() in src/backend/utils/adt/array_userfuncs.c.<br=
><br>The combine function is used during parallel aggregation of array_agg(=
anyarray).<br>It was introduced in commit 16fd03e9565 (&quot;Allow parallel=
 aggregate on string_agg and array_agg&quot;, 2023-01-23), first shipped in=
 PG 16.<br><br>When two partial aggregation states are combined, array_agg_=
array_combine must merge their null bitmaps. The current code only enters t=
he bitmap-handling block when state2 (the incoming partial state) has a nul=
lbitmap:<br><br>=C2=A0 =C2=A0 if (state2-&gt;nullbitmap)<br>=C2=A0 =C2=A0 {=
<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 ...<br>=C2=A0 =C2=A0 }<br><br>I think this =
misses the case where state1 (the running state) already has a nullbitmap b=
ut state2 does not. In that scenario, state2&#39;s data bytes are appended =
to state1&#39;s data buffer and state1-&gt;nitems is incremented, but the n=
ullbitmap is NOT extended to cover state2&#39;s items. The bit positions fo=
r state2&#39;s items are left as uninitialized memory, which randomly marks=
 some elements as NULL. This shifts the interpretation of the data buffer a=
nd corrupts the output.<br><br>For comparison, the non-parallel accumArrayR=
esultArr() in arrayfuncs.c has this condition:<br><br>=C2=A0 =C2=A0 if (ast=
ate-&gt;nullbitmap || ARR_HASNULL(arg))<br><br>which enters the bitmap bloc=
k whenever EITHER the existing state has a bitmap OR the new input has NULL=
s.<br><br>This bug triggers when parallel workers split a group&#39;s rows =
such that one worker sees only NULL-containing arrays (building a state wit=
h a nullbitmap) and another sees only non-NULL arrays (no nullbitmap), and =
the combine function is called with the NULL-containing state as state1 and=
 the non-NULL state as state2. Since row distribution accross workers is no=
n-deterministic, the corruption is too.<br><br>Fixing this condition would =
require a second, related change in the same function. When extending the b=
itmap, the code computes:<br><br>=C2=A0 =C2=A0 int newaitems =3D state1-&gt=
;aitems + state2-&gt;aitems;<br><br>With the corrected condition (state1-&g=
t;nullbitmap || state2-&gt;nullbitmap), state2-&gt;aitems can now be 0 (no =
bitmap was allocated for state2). This makes newaitems equal to state1-&gt;=
aitems, which may be less than newnitems (state1-&gt;nitems + state2-&gt;ni=
tems).<br><br>The subsequent pg_nextpower2_32(newaitems) then allocates a b=
itmap which will be=C2=A0too small, and array_bitmap_copy writes past the e=
nd of it.<br><br>This could be verified with --enable-cassert, running the =
reproduction steps will produce &quot;problem in alloc set ExprContext: req=
 size &gt; alloc size&quot; warnings.<br><br>Fix<br>---<br><br>I think that=
 the following two changes to array_agg_array_combine() in array_userfuncs.=
c fix the issue:<br><br>1. Change the condition guarding the null bitmap bl=
ock from<br>=C2=A0 =C2=A0&quot;if (state2-&gt;nullbitmap)&quot; to<br>=C2=
=A0 =C2=A0&quot;if (state1-&gt;nullbitmap || state2-&gt;nullbitmap)&quot;.<=
br><br>2. Change the bitmap reallocation size from<br>=C2=A0 =C2=A0&quot;st=
ate1-&gt;aitems + state2-&gt;aitems&quot; to<br>=C2=A0 =C2=A0&quot;Max(stat=
e1-&gt;aitems + state2-&gt;aitems, newnitems)&quot;<br>=C2=A0 =C2=A0to ensu=
re the bitmap is always large enough.<br><br>Patch (applies cleanly to REL_=
16_STABLE through REL_18_STABLE and<br>master), I am also including the sam=
e as an attachment:<br><br>--- a/src/backend/utils/adt/array_userfuncs.c<br=
>+++ b/src/backend/utils/adt/array_userfuncs.c<br>@@ -997,7 +997,7 @@<br>=
=C2=A0 state1-&gt;data =3D (char *) repalloc(state1-&gt;data, state1-&gt;ab=
ytes);<br>=C2=A0 }<br>=C2=A0<br>- if (state2-&gt;nullbitmap)<br>+ if (state=
1-&gt;nullbitmap || state2-&gt;nullbitmap)<br>=C2=A0 {<br>=C2=A0 int newnit=
ems =3D state1-&gt;nitems + state2-&gt;nitems;<br>=C2=A0<br>@@ -1015,7 +101=
5,8 @@<br>=C2=A0 }<br>=C2=A0 else if (newnitems &gt; state1-&gt;aitems)<br>=
=C2=A0 {<br>- int newaitems =3D state1-&gt;aitems + state2-&gt;aitems;<br>+=
 int newaitems =3D Max(state1-&gt;aitems + state2-&gt;aitems,<br>+ =C2=A0 n=
ewnitems);<br>=C2=A0<br>=C2=A0 state1-&gt;aitems =3D pg_nextpower2_32(newai=
tems);<br>=C2=A0 state1-&gt;nullbitmap =3D (bits8 *)<br><br>I have verified=
 that this patch eliminates the corruption on all three<br>versions tested =
(17.2, 17.9, 18.3): the corrupted_groups count drops<br>from ~5000 to 0 in =
every run.<font color=3D"#888888"><br></font></font></div><div><font face=
=3D"monospace"><br></font></div><div><div dir=3D"ltr" class=3D"gmail_signat=
ure" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><font face=3D=
"monospace"><br></font></div><div><font face=3D"monospace">Best regards, Dm=
ytro</font></div></div></div></div></div>

--000000000000f31917064e00a682--
--000000000000f31918064e00a684
Content-Type: text/x-patch; charset="US-ASCII"; 
	name="fix_array_agg_parallel_nullbitmap.patch"
Content-Disposition: attachment; 
	filename="fix_array_agg_parallel_nullbitmap.patch"
Content-Transfer-Encoding: base64
Content-ID: <f_mn8vmq8d0>
X-Attachment-Id: f_mn8vmq8d0

LS0tIGEvc3JjL2JhY2tlbmQvdXRpbHMvYWR0L2FycmF5X3VzZXJmdW5jcy5jCisrKyBiL3NyYy9i
YWNrZW5kL3V0aWxzL2FkdC9hcnJheV91c2VyZnVuY3MuYwpAQCAtOTk3LDcgKzk5Nyw3IEBACiAJ
CQlzdGF0ZTEtPmRhdGEgPSAoY2hhciAqKSByZXBhbGxvYyhzdGF0ZTEtPmRhdGEsIHN0YXRlMS0+
YWJ5dGVzKTsKIAkJfQogCi0JCWlmIChzdGF0ZTItPm51bGxiaXRtYXApCisJCWlmIChzdGF0ZTEt
Pm51bGxiaXRtYXAgfHwgc3RhdGUyLT5udWxsYml0bWFwKQogCQl7CiAJCQlpbnQJCQluZXduaXRl
bXMgPSBzdGF0ZTEtPm5pdGVtcyArIHN0YXRlMi0+bml0ZW1zOwogCkBAIC0xMDE1LDcgKzEwMTUs
OCBAQAogCQkJfQogCQkJZWxzZSBpZiAobmV3bml0ZW1zID4gc3RhdGUxLT5haXRlbXMpCiAJCQl7
Ci0JCQkJaW50CQkJbmV3YWl0ZW1zID0gc3RhdGUxLT5haXRlbXMgKyBzdGF0ZTItPmFpdGVtczsK
KwkJCQlpbnQJCQluZXdhaXRlbXMgPSBNYXgoc3RhdGUxLT5haXRlbXMgKyBzdGF0ZTItPmFpdGVt
cywKKwkJCQkJCQkJCQkgICBuZXduaXRlbXMpOwogCiAJCQkJc3RhdGUxLT5haXRlbXMgPSBwZ19u
ZXh0cG93ZXIyXzMyKG5ld2FpdGVtcyk7CiAJCQkJc3RhdGUxLT5udWxsYml0bWFwID0gKGJpdHM4
ICopCg==
--000000000000f31918064e00a684--