Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w66Ji-003x9m-2m for pgsql-bugs@arkaria.postgresql.org; Fri, 27 Mar 2026 12:29:35 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w66Jg-009b7n-32 for pgsql-bugs@arkaria.postgresql.org; Fri, 27 Mar 2026 12:29:33 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w66Jg-009b7c-1l for pgsql-bugs@lists.postgresql.org; Fri, 27 Mar 2026 12:29:33 +0000 Received: from mail-qv1-xf2c.google.com ([2607:f8b0:4864:20::f2c]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w66Je-00000001TXt-0Zxz for pgsql-bugs@lists.postgresql.org; Fri, 27 Mar 2026 12:29:32 +0000 Received: by mail-qv1-xf2c.google.com with SMTP id 6a1803df08f44-89cd8596724so21865296d6.0 for ; Fri, 27 Mar 2026 05:29:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1774614568; cv=none; d=google.com; s=arc-20240605; b=kDEJHbSIqsMamUPwgxm64UEKbpdvlYhGlmatONBedaPtxeeBSrXznLMkHe0GbQwCUT ldd76LA8DQgeCLcqfrWM/q8IvBXcSKjH6hU0ldhpQ/ypo3KDe1wwyFXUfUsz+mmmgvOF 6hE1mI4lHLuxe2zyX8GUft1FjXCC+uIkbpILx1X/RGQiHn+HMZwjrpGr3/qKqbOO89qF Y+AAqgYKmadzJ5wldajBFN3hQvyTIZyrNz0M6sI0vckl0Q5eciATDNr23o7GC+OjfQMG SnLFWsHiOWg+dr7EUXDKdyBiu2cWPoiRS2oLI6wz0UQOpHmPiscTOC20aMzjKvRTyFPn upyg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=to:subject:message-id:date:from:mime-version:dkim-signature; bh=tDbAw15pOCpmE02Tfgej4I7DDlnLIlP6iVuvlcBoZ1o=; fh=/gQe77b11iMZdcPj/nJr/Ghqi6rQp5FPrPbdO93bmOA=; b=V5yHUpTOqOPcHe06TmPl4z9PYpxAixsQsLzWTlWegTP8q3yS/mjfpCeeV1kMoMg8nu bOWQgWmgxOm655s50Z5atUP6p8s28UAKq3Oy3Lf0Mr5ECzNDSHdD+1KRgrCFXSvnzjtY jU5MZKfrdiFwJJ5072ZV5M9kl6pK2mQ5UrYi4n98YhXagk7hvMrF5s+D8JANSiEm2YWT LQvg7t85Uc/nrHdxcpRXzwkRzZOwY3wnhu4pvA0cnc6V0lbIjSiPxPnLKx43uNyrvw+d yGyiosainSfGIGWCAJp0DODFPA833Xe10ZEYN8nFPllbHDJrDs1UytdaiAd9rSYiURY0 6zUg==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1774614568; x=1775219368; darn=lists.postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=tDbAw15pOCpmE02Tfgej4I7DDlnLIlP6iVuvlcBoZ1o=; b=lxPy8uX98UkCIPMjwUdr8p6Vo/N833H6LUeEuXlVGeMOzKOkaOxD4PDeOR4NTJ4X4k pAYMvca1rzt69uWLAcSqUyvps2a/BsUvfKEZzhlobtk2+rlsbmCeQ0+yeITqbtTwzGQc z1O3+iMnBv4dONtckWArUDcFhQ6DFVBarwj7/B2SBrj/CsGmGUJFjuh59bpwDaQcjWsq joS+KMVqtcNPztYDJLQNVOrEq6cob7CJC8B/zjkK+GxGaNoatyegxEooW/rRn8Pzqgd7 1crFA+kqvDvQltdG0JrwA1g0H4aq0VDLXedov+qUqnnvg9GHpZ7Uwnq5B/xEwFd1LJPR WY8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774614568; x=1775219368; h=to:subject:message-id:date:from:mime-version:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=tDbAw15pOCpmE02Tfgej4I7DDlnLIlP6iVuvlcBoZ1o=; b=TuiONwGjJhCg82I7l1F235ZNNclcFFhWdyGCZgJJ4+fB5uPym1tXa2O3HObKrMMIbV rhIiONzRVWwaBTUUThyq7dq81yz6GwhUM67WaMVRXQMNziX8H8HA4bGaqfD9dQvVQ27z 8+63IyyTROOMQ+uPTIzvuP1I1xQdf9RCGJI6XmTkkLdsTLAd1RQbvNy4JXCFRak5wef2 7vvUSQxVj8WBVMOhrXh+UWyC/s11d8I032arrmeMhn11ick2YqxDhpDTA2JidAB/ZHKk su9RkvtnB8DkixUnxh6ZnW2yehL5Zgi6P68kIe+j2ZRXiz8Jnd6ZDqJIxe5F87sPpkHb qWVw== X-Gm-Message-State: AOJu0Yyxu5tJ9kCC81Nj5RVmMSKUXLhNGTT/Bh21b2IX1coR9mCd0hy4 E8+V9Z42cGjLPQysEVXUqh6IeqGZ3Zi/WnXieGjw9ooIApAXD82QtJi7kLGSgrisAR9R3LBnFYd Shtwa3s4l68+SzyZBXY+2B+gf+qT9pto6vfxm X-Gm-Gg: ATEYQzw4GIdZFBnxWyAlfSHkQhlhcxLdczIUBnMMKd1KIC2tYdaU7uMApN3LZUH/ydN Pd2F9tK5qZsTNK2Vb8j+Gzo9/1xMVydj/FR5E6fJ5lmO6UALv2GDJ6/jiFfS0vzIeIKDD4//vY8 uosOwT+iimgOogQdNiNv8dJTjyIWi3QLCddhUIxIcyIPoH9smZGQHalwuKDIG0pq1X7so/iY+03 658JJLQsE0255M0hhg6vxaRWzTUyY4BRQB2ghwuJRFg2Dm581NsqfrEuJiN64ajhpEiOmkS1bj/ gfY3KF/kGg+j4Wp2JgCPCBKBvM59uD0C+Fpz X-Received: by 2002:a05:6214:2128:b0:89a:10d8:f9ca with SMTP id 6a1803df08f44-89ce8dd0b9emr31347086d6.26.1774614568174; Fri, 27 Mar 2026 05:29:28 -0700 (PDT) MIME-Version: 1.0 From: Dmytro Astapov Date: Fri, 27 Mar 2026 12:29:12 +0000 X-Gm-Features: AQROBzBmqA-GwakFwz5yFtnmWfbIhw50QIwKL0seuNh5Eoh3agqjRBd6OORW5NQ Message-ID: Subject: array_agg(anyarray) silently produces corrupt results with parallel workers when inputs mix NULL and non-NULL array elements To: pgsql-bugs@lists.postgresql.org Content-Type: multipart/mixed; boundary="000000000000f31918064e00a684" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000f31918064e00a684 Content-Type: multipart/alternative; boundary="000000000000f31917064e00a682" --000000000000f31917064e00a682 Content-Type: text/plain; charset="UTF-8" PostgreSQL version: 17.2 (also verified on 17.9 and 18.3) Operating system: Linux x86_64 (Red Hat 8) Short description ----------------- array_agg(ARRAY[...]) produces silently corrupted 2-D arrays when the query uses parallel partial aggregation and the input arrays contain a mix of NULL and non-NULL elements. NULL values appear at the wrong positions in the output, and non-NULL values disappear or shift. The corruption is non-deterministic: two identical queries against the same data can return different results. Disabling parallelism with SET max_parallel_workers_per_gather = 0 eliminates the problem. How to reproduce ---------------- The bug requires the planner to choose a Partial GroupAggregate (or Partial HashAggregate) plan with parallel workers. This typically needs a table large enough for the planner to prefer parallel aggregation (I used ~10M rows in my tests). -- Create and populate a test table with ~10M rows. -- Three NULL patterns in m1/m2/m3 columns: -- ~68% rows: m1=val, m2=val, m3='' (no NULLs) -- ~19% rows: m1=NULL, m2=NULL, m3=NULL (all NULL) -- ~13% rows: m1=val, m2='', m3=val (no NULLs) CREATE TABLE test_data ( group_id int NOT NULL, branch_name text NOT NULL, branch_type text NOT NULL, m1 text, m2 text, m3 text ); INSERT INTO test_data SELECT (i / 4), 'BRANCH_' || i, 'Type' || (i % 3), CASE WHEN i%100<68 THEN 'val_'||i WHEN i%100<87 THEN NULL ELSE 'val_'||i END, CASE WHEN i%100<68 THEN 'v2_'||i WHEN i%100<87 THEN NULL ELSE '' END, CASE WHEN i%100<68 THEN '' WHEN i%100<87 THEN NULL ELSE 'v3_'||i END FROM generate_series(1, 2000000) AS g(i); -- Replicate to ~10M rows so the planner picks a parallel plan. INSERT INTO test_data SELECT group_id + 500001, branch_name || '_r' || r, branch_type, m1, m2, m3 FROM test_data, generate_series(1, 4) r; ANALYZE test_data; -- Verify the plan uses Partial GroupAggregate with parallel workers: SET max_parallel_workers_per_gather = 4; EXPLAIN (COSTS OFF) SELECT group_id % 100000 AS gid, array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s FROM test_data GROUP BY group_id % 100000; -- Expected plan: -- Finalize GroupAggregate -- -> Gather Merge -- Workers Planned: 4 -- -> Partial GroupAggregate -- -> Sort -- -> Parallel Seq Scan on test_data -- Create ground truth (no parallelism): SET max_parallel_workers_per_gather = 0; CREATE TABLE gt AS SELECT group_id % 100000 AS gid, array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s FROM test_data GROUP BY group_id % 100000; -- Create parallel result: SET max_parallel_workers_per_gather = 4; CREATE TABLE pr AS SELECT group_id % 100000 AS gid, array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s FROM test_data GROUP BY group_id % 100000; -- Order-independent comparison (eliminates row-ordering differences, -- detects only value corruption): SELECT count(*) AS corrupted_groups FROM ( SELECT gid, array_agg(COALESCE(v, '!!NULL!!') ORDER BY v) AS sv FROM gt, unnest(m1m2m3s) v GROUP BY gid ) a JOIN ( SELECT gid, array_agg(COALESCE(v, '!!NULL!!') ORDER BY v) AS sv FROM pr, unnest(m1m2m3s) v GROUP BY gid ) b ON a.gid = b.gid WHERE a.sv != b.sv; On every run I have tested, corrupted_groups is in the range of 4000-6000 out of 100000 total groups (~5%). Results differ between runs, confirming non-determinism. I verified this on three versions: REL_17_2, REL_17_9, REL_18_3. All three versions were built with: ./configure --enable-debug --enable-cassert Root cause ---------- The bug is seemingly in array_agg_array_combine() in src/backend/utils/adt/array_userfuncs.c. The combine function is used during parallel aggregation of array_agg(anyarray). It was introduced in commit 16fd03e9565 ("Allow parallel aggregate on string_agg and array_agg", 2023-01-23), first shipped in PG 16. When two partial aggregation states are combined, array_agg_array_combine must merge their null bitmaps. The current code only enters the bitmap-handling block when state2 (the incoming partial state) has a nullbitmap: if (state2->nullbitmap) { ... } I think this misses the case where state1 (the running state) already has a nullbitmap but state2 does not. In that scenario, state2's data bytes are appended to state1's data buffer and state1->nitems is incremented, but the nullbitmap is NOT extended to cover state2's items. The bit positions for state2's items are left as uninitialized memory, which randomly marks some elements as NULL. This shifts the interpretation of the data buffer and corrupts the output. For comparison, the non-parallel accumArrayResultArr() in arrayfuncs.c has this condition: if (astate->nullbitmap || ARR_HASNULL(arg)) which enters the bitmap block whenever EITHER the existing state has a bitmap OR the new input has NULLs. This bug triggers when parallel workers split a group's rows such that one worker sees only NULL-containing arrays (building a state with a nullbitmap) and another sees only non-NULL arrays (no nullbitmap), and the combine function is called with the NULL-containing state as state1 and the non-NULL state as state2. Since row distribution accross workers is non-deterministic, the corruption is too. Fixing this condition would require a second, related change in the same function. When extending the bitmap, the code computes: int newaitems = state1->aitems + state2->aitems; With the corrected condition (state1->nullbitmap || state2->nullbitmap), state2->aitems can now be 0 (no bitmap was allocated for state2). This makes newaitems equal to state1->aitems, which may be less than newnitems (state1->nitems + state2->nitems). The subsequent pg_nextpower2_32(newaitems) then allocates a bitmap which will be too small, and array_bitmap_copy writes past the end of it. This could be verified with --enable-cassert, running the reproduction steps will produce "problem in alloc set ExprContext: req size > alloc size" warnings. Fix --- I think that the following two changes to array_agg_array_combine() in array_userfuncs.c fix the issue: 1. Change the condition guarding the null bitmap block from "if (state2->nullbitmap)" to "if (state1->nullbitmap || state2->nullbitmap)". 2. Change the bitmap reallocation size from "state1->aitems + state2->aitems" to "Max(state1->aitems + state2->aitems, newnitems)" to ensure the bitmap is always large enough. Patch (applies cleanly to REL_16_STABLE through REL_18_STABLE and master), I am also including the same as an attachment: --- a/src/backend/utils/adt/array_userfuncs.c +++ b/src/backend/utils/adt/array_userfuncs.c @@ -997,7 +997,7 @@ state1->data = (char *) repalloc(state1->data, state1->abytes); } - if (state2->nullbitmap) + if (state1->nullbitmap || state2->nullbitmap) { int newnitems = state1->nitems + state2->nitems; @@ -1015,7 +1015,8 @@ } else if (newnitems > state1->aitems) { - int newaitems = state1->aitems + state2->aitems; + int newaitems = Max(state1->aitems + state2->aitems, + newnitems); state1->aitems = pg_nextpower2_32(newaitems); state1->nullbitmap = (bits8 *) I have verified that this patch eliminates the corruption on all three versions tested (17.2, 17.9, 18.3): the corrupted_groups count drops from ~5000 to 0 in every run. Best regards, Dmytro --000000000000f31917064e00a682 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
PostgreSQL version: 17.2 (al= so verified on 17.9 and 18.3)
Operating system: =C2=A0 Linux x86_64 (Red= Hat 8)

Short description
-----------------

arr= ay_agg(ARRAY[...]) produces silently corrupted 2-D arrays when the query us= es parallel partial aggregation and the input arrays contain a mix of NULL = and non-NULL elements. NULL values appear at the wrong positions in the out= put, and non-NULL values disappear or shift.

The corruption is non-deterministic: two identical queries = against the same data can return different results.

Disabling parall= elism with SET max_parallel_workers_per_gather =3D 0 eliminates the problem= .

How to reproduce
----------------

The bug requires the p= lanner to choose a Partial GroupAggregate (or Partial HashAggregate) plan w= ith parallel workers. This typically needs a table large enough for the pla= nner to prefer parallel aggregation (I used ~10M rows in my tests).

= -- Create and populate a test table with ~10M rows.
-- Three NULL patter= ns in m1/m2/m3 columns:
-- =C2=A0 ~68% rows: m1=3Dval, m2=3Dval, m3=3D&#= 39;' (no NULLs)
-- =C2=A0 ~19% rows: m1=3DNULL, m2=3DNULL, m3=3DNULL= (all NULL)
-- =C2=A0 ~13% rows: m1=3Dval, m2=3D'', m3=3Dval (no= NULLs)

CREATE TABLE test_data (
=C2=A0 =C2=A0 group_id =C2=A0 = =C2=A0int NOT NULL,
=C2=A0 =C2=A0 branch_name text NOT NULL,
=C2=A0 = =C2=A0 branch_type text NOT NULL,
=C2=A0 =C2=A0 m1 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0text,
=C2=A0 =C2=A0 m2 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0te= xt,
=C2=A0 =C2=A0 m3 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0text
);
INSERT INTO test_data
SELECT
=C2=A0 =C2=A0 (i / 4),
=C2=A0 =C2=A0= 'BRANCH_' || i,
=C2=A0 =C2=A0 'Type' || (i % 3),
=C2= =A0 =C2=A0 CASE WHEN i%100<68 THEN 'val_'||i WHEN i%100<87 TH= EN NULL ELSE 'val_'||i END,
=C2=A0 =C2=A0 CASE WHEN i%100<68 = THEN 'v2_'||i =C2=A0WHEN i%100<87 THEN NULL ELSE '' END,=
=C2=A0 =C2=A0 CASE WHEN i%100<68 THEN '' =C2=A0 =C2=A0 =C2= =A0 =C2=A0WHEN i%100<87 THEN NULL ELSE 'v3_'||i END
FROM gene= rate_series(1, 2000000) AS g(i);

-- Replicate to ~10M rows so the pl= anner picks a parallel plan.
INSERT INTO test_data
SELECT group_id + = 500001, branch_name || '_r' || r, branch_type, m1, m2, m3
FROM t= est_data, generate_series(1, 4) r;

ANALYZE test_data;

-- Veri= fy the plan uses Partial GroupAggregate with parallel workers:
SET max_p= arallel_workers_per_gather =3D 4;
EXPLAIN (COSTS OFF)
SELECT group_id= % 100000 AS gid, array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s
FROM test_data= GROUP BY group_id % 100000;

-- Expected plan:
-- =C2=A0 Finalize= GroupAggregate
-- =C2=A0 =C2=A0 -> =C2=A0Gather Merge
-- =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 Workers Planned: 4
-- =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 -> =C2=A0Partial GroupAggregate
-- =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 -> =C2=A0Sort
-- =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 -> =C2=A0= Parallel Seq Scan on test_data

-- Create ground truth (no parallelis= m):
SET max_parallel_workers_per_gather =3D 0;
CREATE TABLE gt AS
= SELECT group_id % 100000 AS gid,
=C2=A0 =C2=A0 =C2=A0 =C2=A0array_agg(AR= RAY[m1, m2, m3]) AS m1m2m3s
FROM test_data GROUP BY group_id % 100000;
-- Create parallel result:
SET max_parallel_workers_per_gather =3D= 4;
CREATE TABLE pr AS
SELECT group_id % 100000 AS gid,
=C2=A0 =C2= =A0 =C2=A0 =C2=A0array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s
FROM test_data = GROUP BY group_id % 100000;

-- Order-independent comparison (elimina= tes row-ordering differences,
-- detects only value corruption):
SELE= CT count(*) AS corrupted_groups
FROM (
=C2=A0 =C2=A0 SELECT gid, arra= y_agg(COALESCE(v, '!!NULL!!') ORDER BY v) AS sv
=C2=A0 =C2=A0 FR= OM gt, unnest(m1m2m3s) v GROUP BY gid
) a
JOIN (
=C2=A0 =C2=A0 SEL= ECT gid, array_agg(COALESCE(v, '!!NULL!!') ORDER BY v) AS sv
=C2= =A0 =C2=A0 FROM pr, unnest(m1m2m3s) v GROUP BY gid
) b ON a.gid =3D b.gi= d
WHERE=C2=A0a.sv=C2=A0!= =3D=C2=A0b.sv;

On eve= ry run I have tested, corrupted_groups is in the range of 4000-6000 out of = 100000 total groups (~5%). Results differ between runs, confirming non-dete= rminism.

I verified this on three versions: REL_17_2, REL_17_9, REL_= 18_3. All three versions were built with: ./configure --enable-debug --enab= le-cassert


Root cause
----------

The bug is seemingly = in array_agg_array_combine() in src/backend/utils/adt/array_userfuncs.c.
The combine function is used during parallel aggregation of array_agg(= anyarray).
It was introduced in commit 16fd03e9565 ("Allow parallel= aggregate on string_agg and array_agg", 2023-01-23), first shipped in= PG 16.

When two partial aggregation states are combined, array_agg_= array_combine must merge their null bitmaps. The current code only enters t= he bitmap-handling block when state2 (the incoming partial state) has a nul= lbitmap:

=C2=A0 =C2=A0 if (state2->nullbitmap)
=C2=A0 =C2=A0 {=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ...
=C2=A0 =C2=A0 }

I think this = misses the case where state1 (the running state) already has a nullbitmap b= ut state2 does not. In that scenario, state2's data bytes are appended = to state1's data buffer and state1->nitems is incremented, but the n= ullbitmap is NOT extended to cover state2's items. The bit positions fo= r state2's items are left as uninitialized memory, which randomly marks= some elements as NULL. This shifts the interpretation of the data buffer a= nd corrupts the output.

For comparison, the non-parallel accumArrayR= esultArr() in arrayfuncs.c has this condition:

=C2=A0 =C2=A0 if (ast= ate->nullbitmap || ARR_HASNULL(arg))

which enters the bitmap bloc= k whenever EITHER the existing state has a bitmap OR the new input has NULL= s.

This bug triggers when parallel workers split a group's rows = such that one worker sees only NULL-containing arrays (building a state wit= h a nullbitmap) and another sees only non-NULL arrays (no nullbitmap), and = the combine function is called with the NULL-containing state as state1 and= the non-NULL state as state2. Since row distribution accross workers is no= n-deterministic, the corruption is too.

Fixing this condition would = require a second, related change in the same function. When extending the b= itmap, the code computes:

=C2=A0 =C2=A0 int newaitems =3D state1->= ;aitems + state2->aitems;

With the corrected condition (state1-&g= t;nullbitmap || state2->nullbitmap), state2->aitems can now be 0 (no = bitmap was allocated for state2). This makes newaitems equal to state1->= aitems, which may be less than newnitems (state1->nitems + state2->ni= tems).

The subsequent pg_nextpower2_32(newaitems) then allocates a b= itmap which will be=C2=A0too small, and array_bitmap_copy writes past the e= nd of it.

This could be verified with --enable-cassert, running the = reproduction steps will produce "problem in alloc set ExprContext: req= size > alloc size" warnings.

Fix
---

I think that= the following two changes to array_agg_array_combine() in array_userfuncs.= c fix the issue:

1. Change the condition guarding the null bitmap bl= ock from
=C2=A0 =C2=A0"if (state2->nullbitmap)" to
=C2= =A0 =C2=A0"if (state1->nullbitmap || state2->nullbitmap)".<= br>
2. Change the bitmap reallocation size from
=C2=A0 =C2=A0"st= ate1->aitems + state2->aitems" to
=C2=A0 =C2=A0"Max(stat= e1->aitems + state2->aitems, newnitems)"
=C2=A0 =C2=A0to ensu= re the bitmap is always large enough.

Patch (applies cleanly to REL_= 16_STABLE through REL_18_STABLE and
master), I am also including the sam= e as an attachment:

--- a/src/backend/utils/adt/array_userfuncs.c+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -997,7 +997,7 @@
= =C2=A0 state1->data =3D (char *) repalloc(state1->data, state1->ab= ytes);
=C2=A0 }
=C2=A0
- if (state2->nullbitmap)
+ if (state= 1->nullbitmap || state2->nullbitmap)
=C2=A0 {
=C2=A0 int newnit= ems =3D state1->nitems + state2->nitems;
=C2=A0
@@ -1015,7 +101= 5,8 @@
=C2=A0 }
=C2=A0 else if (newnitems > state1->aitems)
= =C2=A0 {
- int newaitems =3D state1->aitems + state2->aitems;
+= int newaitems =3D Max(state1->aitems + state2->aitems,
+ =C2=A0 n= ewnitems);
=C2=A0
=C2=A0 state1->aitems =3D pg_nextpower2_32(newai= tems);
=C2=A0 state1->nullbitmap =3D (bits8 *)

I have verified= that this patch eliminates the corruption on all three
versions tested = (17.2, 17.9, 18.3): the corrupted_groups count drops
from ~5000 to 0 in = every run.


Best regards, Dm= ytro
--000000000000f31917064e00a682-- --000000000000f31918064e00a684 Content-Type: text/x-patch; charset="US-ASCII"; name="fix_array_agg_parallel_nullbitmap.patch" Content-Disposition: attachment; filename="fix_array_agg_parallel_nullbitmap.patch" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_mn8vmq8d0 LS0tIGEvc3JjL2JhY2tlbmQvdXRpbHMvYWR0L2FycmF5X3VzZXJmdW5jcy5jCisrKyBiL3NyYy9i YWNrZW5kL3V0aWxzL2FkdC9hcnJheV91c2VyZnVuY3MuYwpAQCAtOTk3LDcgKzk5Nyw3IEBACiAJ CQlzdGF0ZTEtPmRhdGEgPSAoY2hhciAqKSByZXBhbGxvYyhzdGF0ZTEtPmRhdGEsIHN0YXRlMS0+ YWJ5dGVzKTsKIAkJfQogCi0JCWlmIChzdGF0ZTItPm51bGxiaXRtYXApCisJCWlmIChzdGF0ZTEt Pm51bGxiaXRtYXAgfHwgc3RhdGUyLT5udWxsYml0bWFwKQogCQl7CiAJCQlpbnQJCQluZXduaXRl bXMgPSBzdGF0ZTEtPm5pdGVtcyArIHN0YXRlMi0+bml0ZW1zOwogCkBAIC0xMDE1LDcgKzEwMTUs OCBAQAogCQkJfQogCQkJZWxzZSBpZiAobmV3bml0ZW1zID4gc3RhdGUxLT5haXRlbXMpCiAJCQl7 Ci0JCQkJaW50CQkJbmV3YWl0ZW1zID0gc3RhdGUxLT5haXRlbXMgKyBzdGF0ZTItPmFpdGVtczsK KwkJCQlpbnQJCQluZXdhaXRlbXMgPSBNYXgoc3RhdGUxLT5haXRlbXMgKyBzdGF0ZTItPmFpdGVt cywKKwkJCQkJCQkJCQkgICBuZXduaXRlbXMpOwogCiAJCQkJc3RhdGUxLT5haXRlbXMgPSBwZ19u ZXh0cG93ZXIyXzMyKG5ld2FpdGVtcyk7CiAJCQkJc3RhdGUxLT5udWxsYml0bWFwID0gKGJpdHM4 ICopCg== --000000000000f31918064e00a684--