Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sBrlE-001mG1-8t for pgsql-general@arkaria.postgresql.org; Tue, 28 May 2024 08:00:46 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sBrlE-0091wM-6t for pgsql-general@arkaria.postgresql.org; Tue, 28 May 2024 08:00:44 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sBrlC-0091wE-Uv for pgsql-general@lists.postgresql.org; Tue, 28 May 2024 08:00:43 +0000 Received: from fhigh4-smtp.messagingengine.com ([103.168.172.155]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sBrl8-00194R-5a for pgsql-general@postgresql.org; Tue, 28 May 2024 08:00:42 +0000 Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailfhigh.nyi.internal (Postfix) with ESMTP id 163811140142 for ; Tue, 28 May 2024 04:00:36 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute5.internal (MEProxy); Tue, 28 May 2024 04:00:36 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purefiction.net; h=cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:message-id:mime-version:reply-to :subject:subject:to:to; s=fm3; t=1716883236; x=1716969636; bh=UG uaIGM5xBxQwg8UupUAE9v1vC0031CWzz5hLxDDFqw=; b=rM3K5Vnc8uyol3ESx7 v6PzgizsoXNdMrY1dEF30k0bD0TfVeDkIK28jIJ1SrnX5LTN5PnVSEIC80BPIyJn BcBNqil6ywhr0jAIbcUNJYGnSdVn+vc3Ih3IvLgULThyQ2p29hGw9WHGXnXOu8ta RcpyOysim+jpGhMbuUPTkZinr3sel6ZLB3vu8Vf281z6xEAxLH7ISYVUcczuBGVT gas2wQcwDEzw1rPuI98Vc+IDf1RKfNEF4IPy8Ab2hLSQ4u5TpnbxO2QoHqlo83H1 xSZW0X5KBPhX9tTIsmpJyvPB3ALCk7q8BYWyS5ZKIpm6uzJ/v3BzQZeoiPs14oOR +EyQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:message-id:mime-version:reply-to:subject:subject:to :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm1; t=1716883236; x=1716969636; bh=UGuaIGM5xBxQwg8UupUAE9v1vC00 31CWzz5hLxDDFqw=; b=cmxudshlhPFjnSt5pT9++rr6XKPpWhHx0sbvehrKiBMB G84reWxYeQqkwOgVW2K/rvNojpIbgOABROOeEMF4c5acHKib3/FVaSeZZLQjxTbk HlSDu6jyRq8J3fdHKM4brt3aHR5gK4prEQ0G0WQB+k4CCNmQNG/+Hp7oBGROEk28 2VVRRBDHuwYmr+m7LQ63vemtqBYOF20A9YF89pg2sB87io9LY+KyBrjRKQdCM/Vt O3AUgSELbTN8x5vvYzO1wFVBxHzX9Wdse9ZO4DxEHxdMecoebJkoCgxxXCKt/0eM EwPBywTW/ADFtYbI1YhawS/4xJ/UgPuceVEBh8b9aw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrvdejhedguddvhecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephfgtgfgguffkfffvofesthhqmh dthhdtjeenucfhrhhomheptehlvgigrghnuggvrhcuufhtrghusghouceorghlvgigsehp uhhrvghfihgtthhiohhnrdhnvghtqeenucggtffrrghtthgvrhhnpeevteekhfduueeitd ehheehhffhfeehgeeufeduffetfeffudejieekhfffhfduhfenucffohhmrghinhepghhi thhhuhgsrdgtohhmnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilh hfrhhomheprghlvgigsehpuhhrvghfihgtthhiohhnrdhnvght X-ME-Proxy: Feedback-ID: i341740b3:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Tue, 28 May 2024 04:00:35 -0400 (EDT) From: Alexander Staubo Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.500.171.1.1\)) Subject: Use of inefficient index in the presence of dead tuples Message-Id: Date: Tue, 28 May 2024 10:00:22 +0200 To: "pgsql-general@postgresql.org" X-Mailer: Apple Mail (2.3774.500.171.1.1) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk I am encountering an odd problem where Postgres will use the wrong = index, particularly if the table has some dead tuples. The database affected is running 12.6, but I can = also reproduce with 16.3. To reproduce: (1) Disable autovacuum. This is just so we can induce a scenario where = there are lots of dead tuples. (2) Set up schema. It's important to create the index before insertion, = in order to provoke a situation where the indexes have dead tuples: CREATE TABLE outbox_batches ( id text NOT NULL, receiver text NOT NULL, created_at timestamp without time zone DEFAULT now() NOT = NULL, PRIMARY KEY (receiver, id) ); CREATE INDEX outbox_batches_on_receiver_and_created_at ON outbox_batches (receiver, created_at DESC); (3) Insert 5M rows of dummy data. Note that we are using UUIDs here for = the purposes of testing; in my real database, I use much shorter unique IDs. INSERT INTO outbox_batches (receiver, id) SELECT 'dummy', uuid_generate_v4() FROM (SELECT * FROM generate_series(1, 5000000, 1)) AS foo; (4) Then ensure all tuples are dead except one: DELETE FROM outbox_batches; INSERT INTO outbox_batches (receiver, id) VALUES ('dummy', 'test'); (5) Analyze: ANALYZE outbox_batches; (6) You should now have 5m dead rows and 1 live row: SELECT n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relname = =3D 'outbox_batches'; = =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=90 =E2=94=82 n_live_tup =E2=94=82 n_dead_tup =E2=94=82 = =E2=94=9C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=BC=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=A4 =E2=94=82 1 =E2=94=82 5000000 =E2=94=82 = =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=98 We also observe that the outbox_batches_pkey index is 454 MB, and = the outbox_batches_on_receiver_and_created_at is 31 MB. (7) Try the following query: EXPLAIN (ANALYZE, VERBOSE, BUFFERS, COSTS, TIMING, SETTINGS, = SUMMARY) SELECT id FROM outbox_batches WHERE receiver =3D 'dummy' AND id =3D 'test'; Here's the query plan: Index Scan using outbox_batches_on_receiver_and_created_at on = public.outbox_batches (cost=3D0.38..8.39 rows=3D1 width=3D5) (actual = time=3D0.426..984.038 rows=3D1 loops=3D1) Output: id Index Cond: (outbox_batches.receiver =3D 'dummy'::text) Filter: (outbox_batches.id =3D 'test'::text) Buffers: shared hit=3D3948 read=3D60742 dirtied=3D60741 = written=3D30209 Settings: work_mem =3D '32MB' Query Identifier: -2232653838283363139 Planning: Buffers: shared hit=3D18 read=3D3 Planning Time: 1.599 ms Execution Time: 984.082 ms This query is reading 60K buffers even though it only needs to read a = single row. Notice in particular the use of the index outbox_batches_on_receiver_and_created_at, even though = outbox_batches_pkey would be a much better choice. We know this because if we drop the first index: Index Only Scan using outbox_batches_pkey on public.outbox_batches = (cost=3D0.50..8.52 rows=3D1 width=3D5) (actual time=3D2.067..2.070 = rows=3D1 loops=3D1) Output: id Index Cond: ((outbox_batches.receiver =3D 'dummy'::text) AND = (outbox_batches.id =3D 'test'::text)) Heap Fetches: 1 Buffers: shared hit=3D1 read=3D4 Settings: work_mem =3D '32MB' Query Identifier: -2232653838283363139 Planning: Buffers: shared hit=3D5 dirtied=3D1 Planning Time: 0.354 ms Execution Time: 2.115 ms This is also the index that's used in the normal case when there are no = dead tuples at all. Interestingly, the cost of an index only scan on outbox_batches_pkey is = 8.52, whereas the other is 8.39. Is this because it considers the number of index pages? I've tried = adjusting the various cost and memory settings, but they have no effect. In this test, we created 5M dead tuples. However, for me it also = reproduces with just 1,000 rows. For such a small table, the performance degradation is minimal, but it = increases as more and more tuples are deleted. In a production environment, we have rows being constantly deleted at a = high rate, leaving a table that often has very few live tuples, and often 500K+ dead tuples before = autovacuum can kick in. Here I am consistently seeing the wrong index used, leading to poor = performance. The autovacuum settings ar aggressive, but for whatever reason it is not = keeping up. We also have long-running transactions that sometimes cause the xmin to hang back for = a while, preventing vacuums from helping. All of that said, I would rather Postgres choose the right index than = spend a lot of time optimizing vacuums. Here's my full server config: = https://gist.github.com/atombender/54207d473e415fab26fc59751a22feca.