Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wVTZr-001zLn-34 for pgsql-bugs@arkaria.postgresql.org; Fri, 05 Jun 2026 12:23:08 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wVTZq-00CGni-1M for pgsql-bugs@arkaria.postgresql.org; Fri, 05 Jun 2026 12:23:06 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wVJhI-009Zef-1Q for pgsql-bugs@lists.postgresql.org; Fri, 05 Jun 2026 01:50:08 +0000 Received: from mahout.postgresql.org ([2001:4800:3e1:1::227]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wVJhG-00000001A6w-0en2 for pgsql-bugs@lists.postgresql.org; Fri, 05 Jun 2026 01:50:07 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=postgresql.org; s=20171124; h=Message-ID:Date:Reply-To:Cc:From:To:Subject: Content-Transfer-Encoding:MIME-Version:Content-Type:Sender:Content-ID: Content-Description:In-Reply-To:References; bh=eRPziMgzHA+M7OUQHpO5uIl10Qpc6PnpaRbM3O+Nupk=; b=M+rSAJFjAF+cGn9qpNu5yaH5+L OasbOXccS1cyfboshRIw5ilHvRtu3uGY0CBHREuJxDMfxGGSkQ47kPRAMi9mZDsZn+LrMa5aFo38j Iqc7DPWe/V63UxRTiHTFsP6kUrFJCkIIHUDMywQBguAIqMAUDSEn2zX3xMU38bdOARA5tv+EKuV8S Yxok9M2EdFdPfmVg/u50lxDRIdX/sRJPC4tQDXIuZH8vpqKovWr7o/g2kzhfsiUZBdpYW1x+kUyT5 MlgNPENrRlbSM4iZyPFjet9XIkFEc9Dv9M1KbBH9Fx+z/0qOMhAnZyrhh2Ct1Scl8yi0VsKRFCrcC qoF0JZ5Q==; Received: from wrigleys.postgresql.org ([2a02:16a8:dc51::60]) by mahout.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wVJhE-003Xqn-0o for pgsql-bugs@lists.postgresql.org; Fri, 05 Jun 2026 01:50:05 +0000 Received: from localhost ([127.0.0.1] helo=wrigleys.postgresql.org) by wrigleys.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wVJhC-008FrU-2G for pgsql-bugs@lists.postgresql.org; Fri, 05 Jun 2026 01:50:02 +0000 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: BUG #19512: PG 17.10: SIGSEGV in build_minmax_path (planner) and hash_bytes (HashAgg executor) To: pgsql-bugs@lists.postgresql.org From: PG Bug reporting form Cc: kspark@nepes.co.kr Reply-To: kspark@nepes.co.kr, pgsql-bugs@lists.postgresql.org Date: Fri, 05 Jun 2026 01:49:46 +0000 Message-ID: <19512-507749042aab33c7@postgresql.org> X-Auto-Response-Suppress: All Auto-Submitted: auto-generated List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk The following bug has been logged on the website: Bug reference: 19512 Logged by: Kay Email address: kspark@nepes.co.kr PostgreSQL version: 17.10 Operating system: Ubuntu 24.04 LTS, kernel 6.8.0-110-generic, x86_64 Description: =20 PostgreSQL 17.10 (pgdg, Ubuntu 24.04) crashes with SIGSEGV in two unrelated code paths when planning / executing aggregate queries against a busy TimescaleDB hypertable. Both stacks are entirely inside core PG (the TimescaleDB planner hook is on the stack but the crash address is below it in PG core, not inside the extension's .so). Five cluster-wide cascade restarts in 2 hours today, plus 5 distinct events over the preceding 72 hours. We are not seeing these on a 17.9 cluster running similar workload. ENVIRONMENT - PostgreSQL 17.10 (pgdg, postgresql-17_17.10-1.pgdg24.04+1, binary mtime 2026-05-12, installed 2026-05-19) - Ubuntu 24.04 LTS, kernel 6.8.0-110-generic - x86_64 AMD EPYC, 16 cores - TimescaleDB 2.27.1 (2.27.1~ubuntu24.04-1710) - Workload: ~5-6M rows/h continuous COPY into a 4.7B-row hypertable ("svid_trace"), 1h chunk interval, hypercore/columnstore compression with segmentby on a high-cardinality text column. - Non-default GUCs relevant to crashes: timescaledb.enable_vectorized_aggregation =3D off (workaround for #990= 2) timescaledb.auto_sparse_indexes =3D off timescaledb.enable_composite_bloom_indexes =3D off timescaledb.enable_sparse_index_bloom =3D off max_wal_size =3D 32GB data_checksums =3D off Everything else default. CRASH 1 -- PLANNER SIDE, MIN/MAX PATH BUILD Trigger query (issued by a small bash watchdog every 5 min; ran without crashing for ~3 weeks and only started SIGSEGV'ing on 2026-06-05): SELECT COALESCE(EXTRACT(EPOCH FROM (now() - max(ts)))::bigint, 999999) FROM svid_trace; svid_trace.ts is timestamptz. There is a btree on (ts) plus several composite btrees containing ts as the second column. PG log: 2026-06-05 08:31:24.857 KST [3337534] LOG: server process (PID 4192477) was terminated by signal 11: Segmentation fault 2026-06-05 08:31:24.857 KST [3337534] DETAIL: Failed process was running: SELECT COALESCE(EXTRACT(EPOCH FROM (now() - max(ts)))::bigint, 999999) FROM svid_trace Stack trace (coredumpctl info 4192477): #0 set_base_rel_pathlists (postgres + 0x452c59) #1 query_planner (postgres + 0x474cbf) #2 build_minmax_path (postgres + 0x4750e7) #3 preprocess_minmax_aggregates (postgres + 0x4754c0) #4 grouping_planner (postgres + 0x476f4b) #5 subquery_planner (postgres + 0x47a653) #6 standard_planner (postgres + 0x47a9a8) #7 pgss_planner (pg_stat_statements.so + 0x6ced) #8 timescaledb_planner (timescaledb-2.27.1.so + 0x8071e) #9 planner (postgres + 0x560dcb) #10 pg_plan_queries (postgres + 0x560ec2) #11 exec_simple_query (postgres + 0x56248a) Crash is inside set_base_rel_pathlists during MIN/MAX path generation. CRASH 2 -- EXECUTOR SIDE, HASH AGGREGATION Trigger query (issued every 15 min; fired four times today, crashed every time, 16 minutes apart): SELECT time_bucket('15 minutes', ts) AS bucket, eqp_id, recipe_id, COUNT(*) AS n FROM svid_trace WHERE eqp_id =3D ANY(ARRAY['EQ-A','EQ-B', /* 13 ids total */]) AND recipe_id IS NOT NULL AND recipe_id <> '' AND ts > '2026-06-04 14:15:00+09'::timestamptz AND ts <=3D '2026-06-05 08:30:00+09'::timestamptz + interval '15 minutes' GROUP BY bucket, eqp_id, recipe_id ORDER BY bucket, eqp_id, n DESC; svid_trace is partitioned on ts (1-hour chunk interval). All chunks in the requested window are compressed (hypercore columnstore), except the most recent one which is uncompressed. The query plan when it works (sort-based path, after Workaround 2 below) uses parallel Gather + Sort + GroupAggregate. The crashing path uses HashAgg on a 3-key group (time_bucket, text, text). Stack trace (coredumpctl info 4183753): #0 hash_bytes (postgres + 0x72e480) #1 hash_any (postgres + 0x1b4a29) #2 FunctionCall1Coll (postgres + 0x373114) #3 lookup_hash_entries (postgres + 0x38b178) #4 agg_fill_hash_table (postgres + 0x38c071) #5 ExecProcNode (postgres + 0x3b1c36) #6 ExecProcNode (postgres + 0x375d73) #7 pgss_ExecutorRun (pg_stat_statements.so + 0x3f85) #8 ExecutorRun (postgres + 0x566f53) #9 PortalRun (postgres + 0x568ae8) #10 exec_simple_query (postgres + 0x5625e1) Crash is inside hash_bytes while building/probing the HashAgg table. FunctionCall1Coll on the stack frame just above hash_bytes strongly suggests a Datum (likely the text recipe_id) being passed to its hash function with a bad pointer / length. WHAT WE RULED OUT 1. TimescaleDB: both crash PCs are inside postgres text segment. The only TS frame on the stack (Crash 1) is timescaledb_planner, well above the crash. We previously suspected the ts_executor_end_hook issue we reported against 2.27.1; the new backtraces show it is not. 2. Disk corruption: data_checksums =3D off here, so we cannot prove a negative, but NVMe SMART shows zero media errors, zero data-integrity errors, 4% wear, 100% spare. The crashing queries succeed on rerun (e.g. Crash 2 returned 725 rows in 23 ms after Workaround 2), so chunk contents themselves read out cleanly. 3. Workload spike: pg_stat_activity snapshots from the crash minute show 6 active backends; load average 1.8-2.5. WORKAROUNDS DEPLOYED 1. For Crash 1: replaced the hypertable scan with a watermark table read in our application. SELECT max(ts) FROM svid_trace -> SELECT max(last_start_ts) FROM extraction_watermark. The crashing path (preprocess_minmax_aggregates) is only triggered for the trivial MIN/MAX form on a single column. Any expression breaks the optimization and avoids the crash. 2. For Crash 2: SET LOCAL enable_hashagg =3D off in the same transaction before the GROUP BY. PG plans Sort + GroupAggregate instead and runs cleanly. After both workarounds, zero crashes in the following hour. ADJACENT 17.10 RELEASE-NOTE FIXES WE NOTICED We checked the 17.10 release notes (https://www.postgresql.org/docs/release/17.10/) before filing. Four items sit in or right next to our crash regions: - "Fix incomplete removal of relation references in RestrictInfo structs during join removal" (commit 53cb4ec1de7..). The commit message mentions a use-after-free that only manifests "if the freed space gets claimed by some List node before a Bitmapset can be put there." Same memory-race shape as our non-deterministic Crash 1 -- but our minimal trigger query has no JOIN at all. - "Check for nondeterministic collations before assuming uniqueness from equality." - "Fix incorrect logic for hashed IN / NOT IN with non-strict equality operator (... could crash or give wrong answers in hash aggregation)." - "Fix 'no relation entry for relid 0' in set operations." It is possible our Crash 2 is a still-unfixed sibling of the third item. We have not been able to find a pgsql-bugs report for our exact stacks (search on "set_base_rel_pathlists build_minmax_path" returns 0 hits for the past year). WHAT WE CAN PROVIDE ON REQUEST - The .zst core files for both PIDs are kept locally for 7 more days. We cannot upload them publicly (they include hypertable buffer contents) but can run gdb on them and paste any specific request. - A second cluster running 17.9 + TimescaleDB 2.27.1 with similar workload -- we can compare behaviour. - Optional: we can downgrade one node to 17.9 and re-issue the trigger queries to confirm 17.10 regression. Happy to do this if it would help triage. We will follow up with a minimal SQL+data reproducer if we can produce one outside of production traffic -- open to suggestions on how to construct one given the planner side appears to depend on cumulative memory state.