Re: Wasteful nested loop join when there is `limit` in the query

public inbox for [email protected]  
help / color / mirror / Atom feed

From: WU Yan <[email protected]>
To: Tom Lane <[email protected]>
Cc: [email protected]
Subject: Re: Wasteful nested loop join when there is `limit` in the query
Date: Tue, 18 Feb 2025 12:14:23 +1100
Message-ID: <CAAdwFAxRwMeamr2f88rjLZCWRHmBk=FeWPuuf6TiYqEw64_F9g@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <CAAdwFAwm6HwXM_cuPWZBxrxX4E7pBdVg=KcVDSP6q9ume3hYpQ@mail.gmail.com>
	<[email protected]>

Thank you for your help, Tom.

You are right. I added an index on employee.name (by making it unique), and
then postgres can visit employee table in a pre-sorted manner, and can exit
early without joining more rows.


Just sharing the tweak I did to the example, if anyone else is interested
in a quick test. I also populated 1 million rows so the example is no
longer a toy demo.

```sql
drop table if exists department;
drop table if exists employee;

create table department(
    id int primary key,
    name text);
create table employee(
    id int primary key,
    name text unique,
    department_id int);

INSERT INTO department (id, name)
SELECT i+1, 'department' || i+1
FROM generate_series(0, 9) AS i;

INSERT INTO employee (id, name, department_id)
SELECT i+1, 'name' || i+1, i % 10 +1
FROM generate_series(0, 999999) AS i;

analyze department;
analyze employee;

explain analyze
select *
from employee left outer join department
    on employee.department_id = department.id
order by employee.name limit 10;
```

And here is the plan:
```
                                                                     QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..1.36 rows=10 width=34) (actual time=0.017..0.030
rows=10 loops=1)
   ->  Nested Loop Left Join  (cost=0.57..78630.06 rows=1000000 width=34)
(actual time=0.016..0.028 rows=10 loops=1)
         ->  Index Scan using employee_name_key on employee
 (cost=0.42..54855.68 rows=1000000 width=18) (actual time=0.008..0.015
rows=10 loops=1)
         ->  Memoize  (cost=0.15..0.16 rows=1 width=16) (actual
time=0.001..0.001 rows=1 loops=10)
               Cache Key: employee.department_id
               Cache Mode: logical
               Hits: 6  Misses: 4  Evictions: 0  Overflows: 0  Memory
Usage: 1kB
               ->  Index Scan using department_pkey on department
 (cost=0.14..0.15 rows=1 width=16) (actual time=0.001..0.001 rows=1 loops=4)
                     Index Cond: (id = employee.department_id)
 Planning Time: 0.189 ms
 Execution Time: 0.045 ms
(11 rows)
```

Personally I still wish someday postgres can push down `limit` node
together with `sort` node when certain conditions are met, so that there's
no need to add an index :D

Thank you again for your help!

On Mon, 17 Feb 2025 at 18:01, Tom Lane <[email protected]> wrote:

> WU Yan <[email protected]> writes:
> > Hello everyone, I am still learning postgres planner and performance
> > optimization, so please kindly point out if I missed something obvious.
>
> An index on employee.name would likely help here.  Even if we had
> an optimization for pushing LIMIT down through a join (which you
> are right, we don't) it could not push the LIMIT through a sort step.
> So you need presorted output from the scan of "employee".  I think
> this example would behave better with that.  You may also need to
> test with non-toy amounts of data to get the plan you think is
> better: an example with only half a dozen rows is going to be
> swamped by startup costs.
>
>                         regards, tom lane
>

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Wasteful nested loop join when there is `limit` in the query
  In-Reply-To: <CAAdwFAxRwMeamr2f88rjLZCWRHmBk=FeWPuuf6TiYqEw64_F9g@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox