Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ueElN-006k6o-Ll for pgsql-general@arkaria.postgresql.org; Tue, 22 Jul 2025 15:18:42 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ueElM-009GcA-IB for pgsql-general@arkaria.postgresql.org; Tue, 22 Jul 2025 15:18:40 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ueElM-009Gc2-54 for pgsql-general@lists.postgresql.org; Tue, 22 Jul 2025 15:18:40 +0000 Received: from mail.hjp.at ([212.17.106.138] helo=rorschach.hjp.at) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1ueElI-000FPl-2A for pgsql-general@postgresql.org; Tue, 22 Jul 2025 15:18:40 +0000 Received: by rorschach.hjp.at (Postfix, from userid 1000) id 056871EBE0; Tue, 22 Jul 2025 17:17:36 +0200 (CEST) Date: Tue, 22 Jul 2025 17:17:35 +0200 From: "Peter J. Holzer" To: pgsql-general@postgresql.org Subject: Wrapping a select in another select makes it slower Message-ID: <20250722151735.toridqqnb3krhimp@hjp.at> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="kt7fa7vrwjzf7qnh" Content-Disposition: inline List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --kt7fa7vrwjzf7qnh Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable PostgreSQL version 17.5 on Ubuntu 24.04 (PGDG build). Context: We use CheckMK to monitor our systems. This also reports various health indicators for Postgres. Recently a colleague noticed that the indicators for one database were missing, presumable since upgrading to Ubuntu 24.04 and Postgres 17. Closer investigation showed that the plugin was still running but took a long time (almost 4 minutes) to return a result (which caused a timeout). So I checked which of the many queries was slow and found the culprit. I'm not including the whole query here (it's a mess of of many nested subqueries) only the interesting part. This query returns a result quickly (less than 0.5 seconds): SELECT ns.nspname, tbl.relname, hdr, ma, bs, SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS datawidth, MAX(coalesce(null_frac,0)) AS maxfracsum, hdr+( SELECT 1+count(*)/8 FROM pg_stats s2 WHERE null_frac<>0 AND s2.schemaname =3D ns.nspname AND s2.tablenam= e =3D tbl.relname ) AS nullhdr FROM pg_attribute att JOIN pg_class tbl ON att.attrelid =3D tbl.oid JOIN pg_namespace ns ON ns.oid =3D tbl.relnamespace LEFT JOIN pg_stats s ON s.schemaname=3Dns.nspname AND s.tablename =3D t= bl.relname AND s.inherited=3Dfalse AND s.attname=3Datt.attname, ( SELECT ( SELECT current_setting('block_size')::numeric) AS bs, CASE WHEN SUBSTRING(SPLIT_PART(v, ' ', 2) FROM '#\[0-9]+.[0-9]+= #\%' for '#') IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr, CASE WHEN v ~ 'mingw32' OR v ~ '64-bit' THEN 8 ELSE 4 END AS ma FROM ( SELECT version() AS v ) AS foo ) AS constants WHERE att.attnum > 0 AND tbl.relkind=3D'r' GROUP BY 1,2,3,4,5 ; This is wrapped in a very simple select: SELECT ma,bs,foo.nspname,foo.relname, (datawidth+(hdr+ma-(case when hdr%ma=3D0 THEN ma ELSE hdr%ma END)))::nu= meric AS datahdr, (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=3D0 THEN ma ELSE nullhdr%= ma END))) AS nullhdr2 FROM ( -- the query above ) AS foo; And now it takes 4 minutes. Rewriting it as a normal CTE doesn't change that: with foo as ( -- the query above ) SELECT ... FROM foo; But a materialized CTE with foo as materialized ( -- the query above ) SELECT ... FROM foo; is fast again. Here are the execution plans of the first two queries (from explain (analyz= e)): GroupAggregate (cost=3D7048.46..142133.79 rows=3D2500 width=3D188) (actual= time=3D405.135..550.535 rows=3D625 loops=3D1) Group Key: tbl.relname, ns.nspname InitPlan 1 -> Result (cost=3D0.00..0.02 rows=3D1 width=3D32) (actual time=3D0.01= 7..0.017 rows=3D1 loops=3D1) -> Merge Left Join (cost=3D7048.44..7133.41 rows=3D8357 width=3D136) (a= ctual time=3D404.807..434.347 rows=3D40460 loops=3D1) Merge Cond: ((tbl.relname =3D s.tablename) AND (ns.nspname =3D s.sc= hemaname) AND (att.attname =3D s.attname)) -> Sort (cost=3D2571.61..2592.50 rows=3D8357 width=3D192) (actual= time=3D111.335..120.020 rows=3D40460 loops=3D1) Sort Key: tbl.relname, ns.nspname, att.attname Sort Method: external merge Disk: 8008kB -> Hash Join (cost=3D137.38..2027.20 rows=3D8357 width=3D19= 2) (actual time=3D57.024..75.192 rows=3D40460 loops=3D1) Hash Cond: (tbl.relnamespace =3D ns.oid) -> Hash Join (cost=3D136.29..1980.93 rows=3D8357 widt= h=3D132) (actual time=3D0.617..13.882 rows=3D40460 loops=3D1) Hash Cond: (att.attrelid =3D tbl.oid) -> Seq Scan on pg_attribute att (cost=3D0.00..1= 719.62 rows=3D47572 width=3D68) (actual time=3D0.026..6.804 rows=3D47512 lo= ops=3D1) Filter: (attnum > 0) Rows Removed by Filter: 10218 -> Hash (cost=3D128.47..128.47 rows=3D625 width= =3D72) (actual time=3D0.564..0.565 rows=3D625 loops=3D1) Buckets: 1024 Batches: 1 Memory Usage: 72= kB -> Seq Scan on pg_class tbl (cost=3D0.00.= =2E128.47 rows=3D625 width=3D72) (actual time=3D0.016..0.485 rows=3D625 loo= ps=3D1) Filter: (relkind =3D 'r'::"char") Rows Removed by Filter: 2933 -> Hash (cost=3D1.04..1.04 rows=3D4 width=3D68) (actu= al time=3D56.360..56.360 rows=3D4 loops=3D1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on pg_namespace ns (cost=3D0.00..1.= 04 rows=3D4 width=3D68) (actual time=3D56.319..56.325 rows=3D4 loops=3D1) -> Sort (cost=3D4476.84..4477.18 rows=3D139 width=3D200) (actual = time=3D293.367..300.227 rows=3D30921 loops=3D1) Sort Key: s.tablename, s.schemaname, s.attname Sort Method: external sort Disk: 6480kB -> Subquery Scan on s (cost=3D2189.59..4471.89 rows=3D139 w= idth=3D200) (actual time=3D207.797..253.934 rows=3D30921 loops=3D1) -> Hash Left Join (cost=3D2189.59..4470.50 rows=3D139= width=3D469) (actual time=3D207.791..251.676 rows=3D30921 loops=3D1) Hash Cond: (c.relnamespace =3D n.oid) -> Hash Join (cost=3D2188.50..4468.65 rows=3D13= 9 width=3D140) (actual time=3D207.723..246.208 rows=3D30921 loops=3D1) Hash Cond: ((s_1.starelid =3D c.oid) AND (s= _1.staattnum =3D a.attnum)) -> Seq Scan on pg_statistic s_1 (cost=3D0= =2E00..2008.21 rows=3D15460 width=3D14) (actual time=3D0.024..6.845 rows=3D= 30921 loops=3D1) Filter: (NOT stainherit) -> Hash (cost=3D1899.86..1899.86 rows=3D1= 9243 width=3D142) (actual time=3D207.320..207.323 rows=3D57730 loops=3D1) Buckets: 65536 (originally 32768) Ba= tches: 2 (originally 1) Memory Usage: 7681kB -> Hash Join (cost=3D172.95..1899.8= 6 rows=3D19243 width=3D142) (actual time=3D1.129..176.772 rows=3D57730 loop= s=3D1) Hash Cond: (a.attrelid =3D c.oi= d) Join Filter: has_column_privile= ge(c.oid, a.attnum, 'select'::text) -> Seq Scan on pg_attribute a = (cost=3D0.00..1575.30 rows=3D57730 width=3D70) (actual time=3D0.013..9.457= rows=3D57730 loops=3D1) Filter: (NOT attisdropped) -> Hash (cost=3D128.47..128.4= 7 rows=3D3558 width=3D72) (actual time=3D0.993..0.994 rows=3D3558 loops=3D1) Buckets: 4096 Batches: 1= Memory Usage: 394kB -> Seq Scan on pg_class = c (cost=3D0.00..128.47 rows=3D3558 width=3D72) (actual time=3D0.021..0.558= rows=3D3558 loops=3D1) Filter: ((NOT relro= wsecurity) OR (NOT row_security_active(oid))) -> Hash (cost=3D1.04..1.04 rows=3D4 width=3D68)= (actual time=3D0.023..0.024 rows=3D4 loops=3D1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on pg_namespace n (cost=3D0.0= 0..1.04 rows=3D4 width=3D68) (actual time=3D0.015..0.017 rows=3D4 loops=3D1) SubPlan 2 -> Aggregate (cost=3D53.88..53.90 rows=3D1 width=3D8) (actual time=3D= 0.174..0.175 rows=3D1 loops=3D625) -> Nested Loop (cost=3D0.86..53.87 rows=3D1 width=3D469) (actua= l time=3D0.058..0.172 rows=3D39 loops=3D625) Join Filter: (s_2.starelid =3D c_1.oid) -> Nested Loop (cost=3D0.57..53.42 rows=3D1 width=3D10) (= actual time=3D0.011..0.076 rows=3D71 loops=3D625) -> Nested Loop (cost=3D0.28..9.36 rows=3D1 width=3D= 4) (actual time=3D0.006..0.006 rows=3D1 loops=3D625) Join Filter: (c_1.relnamespace =3D n_1.oid) -> Index Scan using pg_class_relname_nsp_index= on pg_class c_1 (cost=3D0.28..8.30 rows=3D1 width=3D8) (actual time=3D0.0= 02..0.002 rows=3D1 loops=3D625) Index Cond: (relname =3D tbl.relname) Filter: ((NOT relrowsecurity) OR (NOT row= _security_active(oid))) -> Seq Scan on pg_namespace n_1 (cost=3D0.00.= =2E1.05 rows=3D1 width=3D4) (actual time=3D0.001..0.001 rows=3D1 loops=3D62= 5) Filter: (nspname =3D ns.nspname) Rows Removed by Filter: 3 -> Index Scan using pg_attribute_relid_attnum_index = on pg_attribute a_1 (cost=3D0.29..43.98 rows=3D7 width=3D6) (actual time= =3D0.005..0.064 rows=3D71 loops=3D625) Index Cond: (attrelid =3D c_1.oid) Filter: ((NOT attisdropped) AND has_column_priv= ilege(c_1.oid, attnum, 'select'::text)) -> Index Scan using pg_statistic_relid_att_inh_index on pg= _statistic s_2 (cost=3D0.29..0.44 rows=3D1 width=3D6) (actual time=3D0.001= =2E.0.001 rows=3D1 loops=3D44210) Index Cond: ((starelid =3D a_1.attrelid) AND (staattn= um =3D a_1.attnum)) Filter: (stanullfrac <> '0'::double precision) Rows Removed by Filter: 0 Planning Time: 9.752 ms JIT: Functions: 115 Options: Inlining false, Optimization false, Expressions true, Deforming = true Timing: Generation 6.386 ms (Deform 3.235 ms), Inlining 0.000 ms, Optimiz= ation 2.815 ms, Emission 53.702 ms, Total 62.903 ms Execution Time: 592.748 ms Subquery Scan on foo (cost=3D5370.24..142883.56 rows=3D2500 width=3D204) (= actual time=3D256.040..292421.584 rows=3D625 loops=3D1) -> GroupAggregate (cost=3D5370.24..142746.06 rows=3D2500 width=3D188) (= actual time=3D256.022..292420.807 rows=3D625 loops=3D1) Group Key: ns.nspname, tbl.relname InitPlan 1 -> Result (cost=3D0.00..0.02 rows=3D1 width=3D32) (actual time= =3D0.014..0.015 rows=3D1 loops=3D1) -> Incremental Sort (cost=3D5370.23..7745.68 rows=3D8357 width=3D= 136) (actual time=3D255.739..292322.422 rows=3D40460 loops=3D1) Sort Key: ns.nspname, tbl.relname Presorted Key: ns.nspname Full-sort Groups: 2 Sort Method: quicksort Average Memory: = 34kB Peak Memory: 34kB Pre-sorted Groups: 3 Sort Methods: quicksort, external merge= Average Memory: 38kB Peak Memory: 88kB Average Disk: 1874kB Peak Disk:= 5624kB -> Merge Left Join (cost=3D4613.25..7180.30 rows=3D8357 wid= th=3D136) (actual time=3D222.037..292242.701 rows=3D40460 loops=3D1) Merge Cond: (ns.nspname =3D s.schemaname) Join Filter: ((s.tablename =3D tbl.relname) AND (s.attn= ame =3D att.attname)) Rows Removed by Join Filter: 1215045276 -> Nested Loop (cost=3D136.42..2515.44 rows=3D8357 wi= dth=3D192) (actual time=3D65.105..188.427 rows=3D40460 loops=3D1) Join Filter: (tbl.relnamespace =3D ns.oid) Rows Removed by Join Filter: 121380 -> Index Scan using pg_namespace_nspname_index o= n pg_namespace ns (cost=3D0.13..12.19 rows=3D4 width=3D68) (actual time=3D= 0.021..0.039 rows=3D4 loops=3D1) -> Materialize (cost=3D136.29..2022.72 rows=3D8= 357 width=3D132) (actual time=3D16.114..36.529 rows=3D40460 loops=3D4) -> Hash Join (cost=3D136.29..1980.93 rows= =3D8357 width=3D132) (actual time=3D64.422..76.465 rows=3D40460 loops=3D1) Hash Cond: (att.attrelid =3D tbl.oid) -> Seq Scan on pg_attribute att (co= st=3D0.00..1719.62 rows=3D47572 width=3D68) (actual time=3D63.875..69.458 r= ows=3D47512 loops=3D1) Filter: (attnum > 0) Rows Removed by Filter: 10218 -> Hash (cost=3D128.47..128.47 rows= =3D625 width=3D72) (actual time=3D0.490..0.491 rows=3D625 loops=3D1) Buckets: 1024 Batches: 1 Memo= ry Usage: 72kB -> Seq Scan on pg_class tbl (= cost=3D0.00..128.47 rows=3D625 width=3D72) (actual time=3D0.038..0.410 rows= =3D625 loops=3D1) Filter: (relkind =3D 'r':= :"char") Rows Removed by Filter: 2= 933 -> Sort (cost=3D4476.84..4477.18 rows=3D139 width=3D2= 00) (actual time=3D156.864..178628.030 rows=3D1215036357 loops=3D1) Sort Key: s.schemaname Sort Method: external sort Disk: 6480kB -> Subquery Scan on s (cost=3D2189.59..4471.89 = rows=3D139 width=3D200) (actual time=3D81.840..131.044 rows=3D30921 loops= =3D1) -> Hash Left Join (cost=3D2189.59..4470.5= 0 rows=3D139 width=3D469) (actual time=3D81.835..128.788 rows=3D30921 loops= =3D1) Hash Cond: (c.relnamespace =3D n.oid) -> Hash Join (cost=3D2188.50..4468.= 65 rows=3D139 width=3D140) (actual time=3D81.787..123.140 rows=3D30921 loop= s=3D1) Hash Cond: ((s_1.starelid =3D c= =2Eoid) AND (s_1.staattnum =3D a.attnum)) -> Seq Scan on pg_statistic s_= 1 (cost=3D0.00..2008.21 rows=3D15460 width=3D14) (actual time=3D0.014..5.0= 69 rows=3D30921 loops=3D1) Filter: (NOT stainherit) -> Hash (cost=3D1899.86..1899= =2E86 rows=3D19243 width=3D142) (actual time=3D81.660..81.661 rows=3D57730 = loops=3D1) Buckets: 65536 (originall= y 32768) Batches: 2 (originally 1) Memory Usage: 7681kB -> Hash Join (cost=3D17= 2.95..1899.86 rows=3D19243 width=3D142) (actual time=3D1.119..59.096 rows= =3D57730 loops=3D1) Hash Cond: (a.attre= lid =3D c.oid) Join Filter: has_co= lumn_privilege(c.oid, a.attnum, 'select'::text) -> Seq Scan on pg_= attribute a (cost=3D0.00..1575.30 rows=3D57730 width=3D70) (actual time=3D= 0.010..7.991 rows=3D57730 loops=3D1) Filter: (NOT = attisdropped) -> Hash (cost=3D1= 28.47..128.47 rows=3D3558 width=3D72) (actual time=3D1.062..1.063 rows=3D35= 58 loops=3D1) Buckets: 4096= Batches: 1 Memory Usage: 394kB -> Seq Scan = on pg_class c (cost=3D0.00..128.47 rows=3D3558 width=3D72) (actual time=3D= 0.065..0.636 rows=3D3558 loops=3D1) Filter:= ((NOT relrowsecurity) OR (NOT row_security_active(oid))) -> Hash (cost=3D1.04..1.04 rows=3D4= width=3D68) (actual time=3D0.025..0.026 rows=3D4 loops=3D1) Buckets: 1024 Batches: 1 Memo= ry Usage: 9kB -> Seq Scan on pg_namespace n = (cost=3D0.00..1.04 rows=3D4 width=3D68) (actual time=3D0.016..0.018 rows= =3D4 loops=3D1) SubPlan 2 -> Aggregate (cost=3D53.88..53.90 rows=3D1 width=3D8) (actual t= ime=3D0.146..0.146 rows=3D1 loops=3D625) -> Nested Loop (cost=3D0.86..53.87 rows=3D1 width=3D469) = (actual time=3D0.048..0.144 rows=3D39 loops=3D625) Join Filter: (s_2.starelid =3D c_1.oid) -> Nested Loop (cost=3D0.57..53.42 rows=3D1 width= =3D10) (actual time=3D0.009..0.060 rows=3D71 loops=3D625) -> Nested Loop (cost=3D0.28..9.36 rows=3D1 wi= dth=3D4) (actual time=3D0.005..0.005 rows=3D1 loops=3D625) Join Filter: (c_1.relnamespace =3D n_1.oi= d) -> Index Scan using pg_class_relname_nsp= _index on pg_class c_1 (cost=3D0.28..8.30 rows=3D1 width=3D8) (actual time= =3D0.002..0.002 rows=3D1 loops=3D625) Index Cond: (relname =3D tbl.relnam= e) Filter: ((NOT relrowsecurity) OR (N= OT row_security_active(oid))) -> Seq Scan on pg_namespace n_1 (cost= =3D0.00..1.05 rows=3D1 width=3D4) (actual time=3D0.001..0.001 rows=3D1 loop= s=3D625) Filter: (nspname =3D ns.nspname) Rows Removed by Filter: 3 -> Index Scan using pg_attribute_relid_attnum_= index on pg_attribute a_1 (cost=3D0.29..43.98 rows=3D7 width=3D6) (actual = time=3D0.004..0.050 rows=3D71 loops=3D625) Index Cond: (attrelid =3D c_1.oid) Filter: ((NOT attisdropped) AND has_colum= n_privilege(c_1.oid, attnum, 'select'::text)) -> Index Scan using pg_statistic_relid_att_inh_index= on pg_statistic s_2 (cost=3D0.29..0.44 rows=3D1 width=3D6) (actual time= =3D0.001..0.001 rows=3D1 loops=3D44210) Index Cond: ((starelid =3D a_1.attrelid) AND (s= taattnum =3D a_1.attnum)) Filter: (stanullfrac <> '0'::double precision) Rows Removed by Filter: 0 Planning Time: 5.314 ms JIT: Functions: 110 Options: Inlining false, Optimization false, Expressions true, Deforming = true Timing: Generation 11.353 ms (Deform 5.059 ms), Inlining 0.000 ms, Optimi= zation 5.227 ms, Emission 58.982 ms, Total 75.563 ms Execution Time: 292436.103 ms What is happening here? The plans are clearly different, with the slow one claiming to sort 1.2 billion rows (but using only 6.5MB of disk space) despite the node below it only returning 30921 rows. 1215036357 is almost but not quite 30921*40460, but that nested loop is a sibling of the sort, so it shouldn't be included, right? And why are the plans different at all? Computing a few extra values per row shouldn't change the cost of the query delivering the rows, IMHO. But then the costs are very similar, so maybe it's just some random variation. Some estimates are quite a bit off, even on the tables. I did ANALYZE the whole database (and then the tables, again) during my tests. hjp --=20 _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" --kt7fa7vrwjzf7qnh Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmh/q4gACgkQ8g5IURL+ KF0/3w/+OGyIzKTmaMSfHlnbKdSAJl4ysKBa+2YFXfsnKCtQcEcUYVrDNOOBMS+r UTXuRbuHIkMHA3CO9nPaK+VOT0lObPtPJjaDbAncTJrx8nax0q8WA9hdvRXrJk7E iI+16JrLXZOXNOJMtdRMmfWELaw/KwfzGNU33aiZfPBBlaUic/xnW9akSHJXIODY Uo3bGdwyn6ddxShAI/szUa6ebgT8ZOjyJ0WcbejUTm3fnzk/Nw2CdnDH839WxpG5 utSLYeEmBa+MmIHDL1oCSV3duGBbLcK3CqFIM4/GuyEkS25t9+tUKhv6YGJr5CiS D2LCkqSzZ1q9b7XaCwFRnPRZPwl8hiICukJb54oeauRFvzIWI2wTAjA4B3i3Q3s8 aVR3ukZZeOCNw7AbPXyUSUJpBFEKVTUYFP9qZo7/bHBwDYaZUFgS6aFlpBtDo7Hc kFV4S/WaMr8qWIDjnuag7DmzXGI+Eb8dPFKLc8x08n9ZL4nwDHxi9ZIOtEoMwP21 Ml9CphkHxiiUaoA+uTm9nMCQOI73LJ3vTeHwzYFbNd9LtOGjrLU2n420+gRlfDQE Aen7RgSfhN9wztssNCnc5v2aRyi7sESWS5o2i1hRgVC/0RAStSS99BMd++4+5Brj xWojr4EVBvxqAjg3scszOBlw3crKJK5Q90mcpNFefXoeKytWY7k= =EFR1 -----END PGP SIGNATURE----- --kt7fa7vrwjzf7qnh--