Feedback-ID: i341740b3:Fastmail
From: Alexander Staubo <alex@purefiction.net>
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.500.171.1.1\))
Subject: Use of inefficient index in the presence of dead tuples
Message-Id: <DC43B9C3-7BCB-4671-A69E-B0061C710241@purefiction.net>
Date: Tue, 28 May 2024 10:00:22 +0200
To: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Archived-At: <https://www.postgresql.org/message-id/DC43B9C3-7BCB-4671-A69E-B0061C710241%40purefiction.net>
Precedence: bulk

I am encountering an odd problem where Postgres will use the wrong =
index, particularly if the table
has some dead tuples. The database affected is running 12.6, but I can =
also reproduce with 16.3.

To reproduce:

(1) Disable autovacuum. This is just so we can induce a scenario where =
there are lots of dead tuples.

(2) Set up schema. It's important to create the index before insertion, =
in order to provoke a
situation where the indexes have dead tuples:

    CREATE TABLE outbox_batches (
        id             text    NOT NULL,
        receiver       text    NOT NULL,
        created_at     timestamp without time zone DEFAULT now() NOT =
NULL,
        PRIMARY KEY (receiver, id)
    );
    CREATE INDEX outbox_batches_on_receiver_and_created_at
        ON outbox_batches (receiver, created_at DESC);

(3) Insert 5M rows of dummy data. Note that we are using UUIDs here for =
the purposes of testing; in
    my real database, I use much shorter unique IDs.

    INSERT INTO outbox_batches (receiver, id)
        SELECT 'dummy', uuid_generate_v4()
        FROM (SELECT * FROM generate_series(1, 5000000, 1)) AS foo;

(4) Then ensure all tuples are dead except one:

    DELETE FROM outbox_batches;
    INSERT INTO outbox_batches (receiver, id) VALUES ('dummy', 'test');

(5) Analyze:

    ANALYZE outbox_batches;

(6) You should now have 5m dead rows and 1 live row:

    SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relname =
=3D 'outbox_batches';
    =
=E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=90
    =E2=94=82 n_live_tup =E2=94=82 n_dead_tup =E2=94=82
    =
=E2=94=9C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=BC=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=A4
    =E2=94=82          1 =E2=94=82    5000000 =E2=94=82
    =
=E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94=
=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=98

    We also observe that the outbox_batches_pkey index is 454 MB, and =
the
    outbox_batches_on_receiver_and_created_at is 31 MB.

(7) Try the following query:

    EXPLAIN (ANALYZE, VERBOSE, BUFFERS, COSTS, TIMING, SETTINGS, =
SUMMARY)
    SELECT id FROM outbox_batches
    WHERE receiver =3D 'dummy'
    AND id =3D 'test';

Here's the query plan:

    Index Scan using outbox_batches_on_receiver_and_created_at on =
public.outbox_batches  (cost=3D0.38..8.39 rows=3D1 width=3D5) (actual =
time=3D0.426..984.038 rows=3D1 loops=3D1)
    Output: id
    Index Cond: (outbox_batches.receiver =3D 'dummy'::text)
    Filter: (outbox_batches.id =3D 'test'::text)
    Buffers: shared hit=3D3948 read=3D60742 dirtied=3D60741 =
written=3D30209
    Settings: work_mem =3D '32MB'
    Query Identifier: -2232653838283363139
    Planning:
    Buffers: shared hit=3D18 read=3D3
    Planning Time: 1.599 ms
    Execution Time: 984.082 ms

This query is reading 60K buffers even though it only needs to read a =
single row. Notice in particular the
use of the index outbox_batches_on_receiver_and_created_at, even though =
outbox_batches_pkey would be
a much better choice. We know this because if we drop the first index:

    Index Only Scan using outbox_batches_pkey on public.outbox_batches  =
(cost=3D0.50..8.52 rows=3D1 width=3D5) (actual time=3D2.067..2.070 =
rows=3D1 loops=3D1)
    Output: id
    Index Cond: ((outbox_batches.receiver =3D 'dummy'::text) AND =
(outbox_batches.id =3D 'test'::text))
    Heap Fetches: 1
    Buffers: shared hit=3D1 read=3D4
    Settings: work_mem =3D '32MB'
    Query Identifier: -2232653838283363139
    Planning:
    Buffers: shared hit=3D5 dirtied=3D1
    Planning Time: 0.354 ms
    Execution Time: 2.115 ms

This is also the index that's used in the normal case when there are no =
dead tuples at all.

Interestingly, the cost of an index only scan on outbox_batches_pkey is =
8.52, whereas the other is
8.39. Is this because it considers the number of index pages? I've tried =
adjusting the various cost
and memory settings, but they have no effect.

In this test, we created 5M dead tuples. However, for me it also =
reproduces with just 1,000 rows.
For such a small table, the performance degradation is minimal, but it =
increases as more and more
tuples are deleted.

In a production environment, we have rows being constantly deleted at a =
high rate, leaving a table
that often has very few live tuples, and often 500K+ dead tuples before =
autovacuum can kick in. Here
I am consistently seeing the wrong index used, leading to poor =
performance.

The autovacuum settings ar aggressive, but for whatever reason it is not =
keeping up. We also have
long-running transactions that sometimes cause the xmin to hang back for =
a while, preventing
vacuums from helping.

All of that said, I would rather Postgres choose the right index than =
spend a lot of time optimizing
vacuums.

Here's my full server config: =
https://gist.github.com/atombender/54207d473e415fab26fc59751a22feca.