Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tjumd-00Ed3J-A7 for pgsql-general@arkaria.postgresql.org; Mon, 17 Feb 2025 06:39:11 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tjumb-00Df2U-90 for pgsql-general@arkaria.postgresql.org; Mon, 17 Feb 2025 06:39:09 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <4wuyan@gmail.com>) id 1tjuma-00Df1u-Ss for pgsql-general@lists.postgresql.org; Mon, 17 Feb 2025 06:39:08 +0000 Received: from mail-wm1-x335.google.com ([2a00:1450:4864:20::335]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from <4wuyan@gmail.com>) id 1tjumW-001Hu0-0b for pgsql-general@lists.postgresql.org; Mon, 17 Feb 2025 06:39:08 +0000 Received: by mail-wm1-x335.google.com with SMTP id 5b1f17b1804b1-43984e9cc90so2434885e9.1 for ; Sun, 16 Feb 2025 22:39:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1739774342; x=1740379142; darn=lists.postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=R5Smx/ZYdAnsubQne/DL5Ew2pJc6VdRGD1Mgdlb4IE0=; b=Edku15+js5l8kAqsQ48ikGLaZLocH5Hg/jR/2c7ea0pD2XLOrGtznY2PxqOrsn1moz v3IBONZMjEF12GaGA5o6PBfqxY8oLqEglus/EpUawbmrktTI38Qsjg9TiQZK5xWDAOGn dUoorerg5R9hHKfJlQMPChU1d/owUhRDOW9vY4cacFLuEtDwYPK0avkKhZHgeTcmzyQ4 ICiyK0d1TqCVvNZFTv8PSpRjMQWLRqlk+O9Qp2hnASGuy2DjaUaCLdWA5YTYOL7JWul2 hmWyQX260QE5yaqkNBUrTMWt/eV0FdNFNEn2/vcO6fkXaugNtyPXmmJj+3QXmBIo2ztL jEkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739774342; x=1740379142; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=R5Smx/ZYdAnsubQne/DL5Ew2pJc6VdRGD1Mgdlb4IE0=; b=Tmu5aCXevmQSQKy0y2tKbpUBF3k8RzF3h33IxWg6acsUPMxAcoSQ5vMNc4jEE7uXUA tACMU03eIKxPSXWWVfL0U6g0OSpKxVaylVrv1l8trCp2QFW58OOqtw++qx6gf51h3nwC MwnODYOyr0APdBtbSYHNw6lb08MGOfmqx2scPX6nGvlptk6vrzVxzVHXQsYKvU4NcWvn Zg2H7L4gMOACa+UwDvTwh5NEoxiCw8rbES5pEVeh4MfsFl77JYS0R3hxcYSG3C8sCWqi iHsfDuO3q+Rc7o0dZ+qNwhMnlwj3RAP6TSltBYqqG4nS0wWfKgshw7oOyTHd1iwh1PsZ CjPA== X-Gm-Message-State: AOJu0Yw6dz8eEFDrysa9PLjaDu1amr036z8Q6QUhyD1N8NdA+zzTXeIj OffPjagUtoH8y3JqxfOwew08Fi5LJn9MMfuZ+EHHXVtf/4MTULQFCN7vMaGhG3h+KViGxHFjzTq fcqECxXj2cva4/mx+vJ8K1PW12twjGI4M X-Gm-Gg: ASbGncv0K15hu+PBGxwT9iJW1r1Fnqw1th41eEZEkE8Z43QAwM7YPWU04/2qqYEBhkZ bjl7Zxf8hfQnOvsE72Fll0fiAOW+Enq6jo83pT+INvelYjjjTof5H19rAU4dFmk6F3fKjbK+AEA == X-Google-Smtp-Source: AGHT+IHF3ca36zI3sNEb2GUMWhlHgqi3ybNNrCeYmrbEcKmn0Mj74OF/J9hI2k9YTKtekZ7SXRMUGsCmhAgjtXLJBvM= X-Received: by 2002:a05:6000:41e4:b0:38f:232e:dd5e with SMTP id ffacd0b85a97d-38f24d7417emr19469762f8f.22.1739774340970; Sun, 16 Feb 2025 22:39:00 -0800 (PST) MIME-Version: 1.0 From: WU Yan <4wuyan@gmail.com> Date: Mon, 17 Feb 2025 17:38:49 +1100 X-Gm-Features: AWEUYZk93JVHDExxAV5t2cUa9VAjjJpLADRu7cXHJOtIDZGZ9h4hxa6Fg77Q9kw Message-ID: Subject: Wasteful nested loop join when there is `limit` in the query To: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000953f0c062e50c745" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000953f0c062e50c745 Content-Type: text/plain; charset="UTF-8" Hello everyone, I am still learning postgres planner and performance optimization, so please kindly point out if I missed something obvious. I've noticed that postgres joins all rows in two tables, even though there's a `limit` in the query. It means lots of joined rows are discarded eventually, and the performance can be better if postgres planner can do `limit` before the `join`. Here's my example. I am running postgres 17.3 in a container ```sh docker run --rm -e POSTGRES_PASSWORD=mysecretpassword -d -p 5432:5432 postgres:17.3 ``` ```sql create table department( id int primary key, name text); create table employee( id int primary key, name text, department_id int); insert into department values (1, 'sales'), (2, 'support'), (3, 'research'), (4, 'management'), (5, 'hr'); insert into employee values (101, 'alice', 1), (102, 'bob', 2), (103, 'ccc', 3), (104, 'ddd', 4), (105, 'eee', 999); analyze department; analyze employee; set enable_mergejoin = off; set enable_hashjoin = off; explain analyze select * from employee left outer join department on employee.department_id = department.id order by employee.name limit 2; ``` In this simple setup, I want to see the top 2 employees by name, and check their department info. Here's the query plan: ``` QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Limit (cost=2.49..2.49 rows=2 width=23) (actual time=0.037..0.038 rows=2 loops=1) -> Sort (cost=2.49..2.50 rows=5 width=23) (actual time=0.036..0.037 rows=2 loops=1) Sort Key: employee.name Sort Method: top-N heapsort Memory: 25kB -> Nested Loop Left Join (cost=0.00..2.44 rows=5 width=23) (actual time=0.014..0.021 rows=5 loops=1) Join Filter: (employee.department_id = department.id) Rows Removed by Join Filter: 11 -> Seq Scan on employee (cost=0.00..1.05 rows=5 width=12) (actual time=0.006..0.007 rows=5 loops=1) -> Materialize (cost=0.00..1.07 rows=5 width=11) (actual time=0.001..0.001 rows=3 loops=5) -> Seq Scan on department (cost=0.00..1.05 rows=5 width=11) (actual time=0.001..0.002 rows=5 loops=1) Planning Time: 0.248 ms Execution Time: 0.062 ms (12 rows) ``` Note `loops=5` in the plan. My understanding is, all 5 rows in table employee are joined, although we only need 2. I think join before limit is necessary if we do not know whether join will create multiple rows or eliminate rows. However, in this setup, it's left outer join on a primary key, so we know one employee will produce exactly one row. Similar assumption can be made for `a inner join b on a.x = b.x` when a.x is a foreign key referencing b.x and b.x has a unique constraint. At the moment, my workaround is to rewrite the query with CTE. I need to write `order by ... limit ...` twice. `loops=2` is observed. ``` postgres=# explain analyze with t as ( select * from employee order by employee.name limit 2 ) select * from t left outer join department on t.department_id = department.id order by t.name limit 2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Limit (cost=1.10..2.32 rows=2 width=23) (actual time=0.045..0.049 rows=2 loops=1) -> Nested Loop Left Join (cost=1.10..2.32 rows=2 width=23) (actual time=0.044..0.047 rows=2 loops=1) Join Filter: (employee.department_id = department.id) Rows Removed by Join Filter: 1 -> Limit (cost=1.10..1.10 rows=2 width=12) (actual time=0.032..0.033 rows=2 loops=1) -> Sort (cost=1.10..1.11 rows=5 width=12) (actual time=0.032..0.033 rows=2 loops=1) Sort Key: employee.name Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on employee (cost=0.00..1.05 rows=5 width=12) (actual time=0.015..0.016 rows=5 loops=1) -> Materialize (cost=0.00..1.07 rows=5 width=11) (actual time=0.004..0.004 rows=2 loops=2) -> Seq Scan on department (cost=0.00..1.05 rows=5 width=11) (actual time=0.004..0.005 rows=2 loops=1) Planning Time: 0.196 ms Execution Time: 0.103 ms (13 rows) ``` I wonder whether there's a better way to influence the planner to push the `limit` down to the `employee` table first, or this optimization strategy is not yet in the planner. The demo is trivial, but in some real cases, it's possible to select top 100 from 1 million rows. Sorting 1 million rows is slow but necessary and acceptable, but wasteful nested loop join on millions of rows can make the performance much slower. Even worse, the wasteful join can happen multiple times when it's more than 2 tables joining. Thank you Best regards Yan --000000000000953f0c062e50c745 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello everyone, I am still learning postgres planner and p= erformance optimization, so please kindly point out if I missed something o= bvious.

I've noticed that postgres joins all rows in two tables,= even though there's a `limit` in the query. It means lots of joined ro= ws are discarded eventually, and the performance can be better if postgres = planner can do `limit` before the `join`.

Here's my example. I a= m running postgres 17.3 in a container
```sh
docker run --rm -e POSTG= RES_PASSWORD=3Dmysecretpassword -d -p 5432:5432 postgres:17.3
```
```sql
create table department(
=C2=A0 =C2=A0 id int primary key,=C2=A0 =C2=A0 name text);
create table employee(
=C2=A0 =C2=A0 id in= t primary key,
=C2=A0 =C2=A0 name text,
=C2=A0 =C2=A0 department_id i= nt);
insert into department values
=C2=A0 =C2=A0 (1, 'sales')= ,
=C2=A0 =C2=A0 (2, 'support'),
=C2=A0 =C2=A0 (3, 'resear= ch'),
=C2=A0 =C2=A0 (4, 'management'),
=C2=A0 =C2=A0 (5, = 'hr');
insert into employee values
=C2=A0 =C2=A0 (101, 'a= lice', 1),
=C2=A0 =C2=A0 (102, 'bob', 2),
=C2=A0 =C2=A0 (= 103, 'ccc', 3),
=C2=A0 =C2=A0 (104, 'ddd', 4),
=C2=A0= =C2=A0 (105, 'eee', 999);
analyze department;
analyze employ= ee;

set enable_mergejoin =3D off;
set enable_hashjoin =3D off;
explain analyze
select *
from employee left outer join departmen= t
=C2=A0 =C2=A0 on employee.department_id =3D department.id
order by emplo= yee.name limit 2;
```

In this simple setup, I want to see the= top 2 employees by name, and check their department info. Here's the q= uery plan:
```
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0QUERY PLAN
----------------------------------------------------------= -------------------------------------------------------------------
=C2= =A0Limit =C2=A0(cost=3D2.49..2.49 rows=3D2 width=3D23) (actual time=3D0.037= ..0.038 rows=3D2 loops=3D1)
=C2=A0 =C2=A0-> =C2=A0Sort =C2=A0(cost=3D= 2.49..2.50 rows=3D5 width=3D23) (actual time=3D0.036..0.037 rows=3D2 loops= =3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: employee.name
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Meth= od: top-N heapsort =C2=A0Memory: 25kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= -> =C2=A0Nested Loop Left Join =C2=A0(cost=3D0.00..2.44 rows=3D5 width= =3D23) (actual time=3D0.014..0.021 rows=3D5 loops=3D1)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Join Filter: (employee.department_id = =3D department.id)
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Rows Removed by Join Filter: 11=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Seq Sca= n on employee =C2=A0(cost=3D0.00..1.05 rows=3D5 width=3D12) (actual time=3D= 0.006..0.007 rows=3D5 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0-> =C2=A0Materialize =C2=A0(cost=3D0.00..1.07 rows=3D5 = width=3D11) (actual time=3D0.001..0.001 rows=3D3 loops=3D5)
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2= =A0Seq Scan on department =C2=A0(cost=3D0.00..1.05 rows=3D5 width=3D11) (ac= tual time=3D0.001..0.002 rows=3D5 loops=3D1)
=C2=A0Planning Time: 0.248 = ms
=C2=A0Execution Time: 0.062 ms
(12 rows)
```

Note `loops= =3D5` in the plan. My understanding is, all 5 rows in table employee are jo= ined, although we only need 2. I think join before limit is necessary if we= do not know whether join will create multiple rows or eliminate rows. Howe= ver, in this setup, it's left outer join on a primary key, so we know o= ne employee will produce exactly one row. Similar assumption can be made fo= r `a inner join b on a.x =3D b.x` when a.x is a foreign key referencing b.x= and b.x has a unique constraint.


At the moment, my workaround i= s to rewrite the query with CTE. I need to write `order by ... limit ...` t= wice. `loops=3D2` is observed.
```
postgres=3D# explain analyze
wi= th t as (
=C2=A0 =C2=A0 select * from employee order by employee.name limit 2
) select *
from t left out= er join department on t.department_id =3D = department.id
order by t.name limit 2;=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 QUERY PLAN
--------= ---------------------------------------------------------------------------= ----------------------------------------
=C2=A0Limit =C2=A0(cost=3D1.10.= .2.32 rows=3D2 width=3D23) (actual time=3D0.045..0.049 rows=3D2 loops=3D1)<= br>=C2=A0 =C2=A0-> =C2=A0Nested Loop Left Join =C2=A0(cost=3D1.10..2.32 = rows=3D2 width=3D23) (actual time=3D0.044..0.047 rows=3D2 loops=3D1)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Join Filter: (employee.department_id =3D department.id)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0Rows Removed by Join Filter: 1
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0-> =C2=A0Limit =C2=A0(cost=3D1.10..1.10 rows=3D2 width=3D12) (actu= al time=3D0.032..0.033 rows=3D2 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Sort =C2=A0(cost=3D1.10..1.11 rows= =3D5 width=3D12) (actual time=3D0.032..0.033 rows=3D2 loops=3D1)
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort K= ey: employee.name
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Method: top= -N heapsort =C2=A0Memory: 25kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Seq Scan on employee =C2=A0(= cost=3D0.00..1.05 rows=3D5 width=3D12) (actual time=3D0.015..0.016 rows=3D5= loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Materialize = =C2=A0(cost=3D0.00..1.07 rows=3D5 width=3D11) (actual time=3D0.004..0.004 r= ows=3D2 loops=3D2)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0-> =C2=A0Seq Scan on department =C2=A0(cost=3D0.00..1.05 rows=3D5 wid= th=3D11) (actual time=3D0.004..0.005 rows=3D2 loops=3D1)
=C2=A0Planning = Time: 0.196 ms
=C2=A0Execution Time: 0.103 ms
(13 rows)
```
I wonder whether there's a better way to influence the planner to push= the `limit` down to the `employee` table first, or this optimization strat= egy is not yet in the planner. The demo is trivial, but in some real cases,= it's possible to select top 100 from 1 million rows. Sorting 1 million= rows is slow but necessary and acceptable, but wasteful nested loop join o= n millions of rows can make the performance much slower. Even worse, the wa= steful join can happen multiple times when it's more than 2 tables join= ing.

Thank you

Best regards
Yan
--000000000000953f0c062e50c745--