Optimize CPU usage of dropping buffers during recovery

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Jingtang Zhang <[email protected]>
To: [email protected]
To: PostgreSQL-development <[email protected]>
Subject: Optimize CPU usage of dropping buffers during recovery
Date: Wed, 28 Jan 2026 01:41:52 +0800
Message-ID: <[email protected]> (raw)

Hi hackers,

I happen to find a case where a transaction involves dropping a relation
file, and a relation happens to be empty, the WAL recovery will waste
a lot of time on scanning the whole buffer pool.

In DropRelationBuffers/DropRelationsAllBuffers, we will take fast path
to avoid scanning whole buffer pool, if the buffers to be invalidated
are less than typically 1/32 of buffer pool. However, since no more WAL
records are referencing an empty relation, if we want to drop it, startup
process will never initialize smgr_cached_nblocks for that relation,
causing fallback to slow path.

The situation can be like refreshing a materialized view whose query
returns no row, or truncating a table and then loading it but the query
returns no row. It will generate WAL sequence like:

rmgr: Standby     len (rec/tot):     42/    42, tx:        780, lsn: 0/41238520, prev 0/41238490, desc: LOCK xid 780 db 16384 rel 16416
rmgr: Storage     len (rec/tot):     42/    42, tx:        780, lsn: 0/412385F0, prev 0/412385C0, desc: CREATE base/16384/16416
...
rmgr: Transaction len (rec/tot):   1073/  1073, tx:        783, lsn: 0/4136BF70, prev 0/41369F60, desc: COMMIT 2026-01-27 08:07:49.953301 UTC; rels: base/16384/16416

This makes me think about initializing the cached nblocks to zero when
redo CREATE record. Note that the 0-size may not correct, because the
relation may have been zeroextended before server crash, and recovery
starts from an earlier checkpoint. But we can correct that by invalidating
the cached value when we really meet a WAL record referencing a block of
that relation, so that a later smgrnblocks() will fill-in the correct
file length. It can avoid extending the relation in the middle of a file
by mistake.

Quickly proposed a patch for it, with a Perl program for judging the
performance difference.

In my environment with 16GB shared buffers, -O2 optimized, debugging off,
I've run a test of 10 client * 500 transactions using such workload, made
the server crash and poll it until recovery is done. The CPU time:

w/o patch: CPU user: 77.58 s, system: 0.27 s
patched:   CPU user: 0.14 s,  system: 0.09 s

Without the patch, the CPU is wasted on BufTagMatchesRelFileLocator:

  Overhead  Command   Shared Object       Symbol
-   98.23%  postgres  postgres            [.] DropRelationsAllBuffers
     98.21% DropRelationsAllBuffers
        smgrdounlinkall
        DropRelationFiles
        xact_redo_commit
        xact_redo
        PerformWalRecovery
        StartupXLOG
        StartupProcessMain
        postmaster_child_launch
        StartChildProcess
        PostmasterStateMachine
        ServerLoop.isra.0
     0.06%  postgres  postgres            [.] hash_search_with_hash_value

Not sure if there will be any problem of correctness. Also, not sure about
what to do about INIT fork.

Any thoughts?

--
Regards, Jingtang




Attachments:

  [application/octet-stream] 0001-Optimize-CPU-usage-of-dropping-buffers-during-recove.patch (4.2K, 2-0001-Optimize-CPU-usage-of-dropping-buffers-during-recove.patch)
  download | inline diff:
From 2d8caed717a4ee68e045cb334376859049518789 Mon Sep 17 00:00:00 2001
From: Jingtang Zhang <[email protected]>
Date: Wed, 28 Jan 2026 01:36:21 +0800
Subject: [PATCH] Optimize CPU usage of dropping buffers during recovery

Initialize cached nblocks to 0 when redo CREATE record.
---
 src/backend/access/transam/xlogutils.c    |  2 +
 src/backend/catalog/storage.c             |  8 +++
 src/test/recovery/t/060_truncate_empty.pl | 73 +++++++++++++++++++++++
 3 files changed, 83 insertions(+)
 create mode 100644 src/test/recovery/t/060_truncate_empty.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5fbe39133b8..90627683935 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -489,6 +489,8 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 	 */
 	smgrcreate(smgr, forknum, true);
 
+	if (smgr->smgr_cached_nblocks[forknum] == 0)
+		smgr->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	lastblock = smgrnblocks(smgr, forknum);
 
 	if (blkno < lastblock)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index e443a4993c5..6e35b2bfbdb 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -993,6 +993,14 @@ smgr_redo(XLogReaderState *record)
 
 		reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
 		smgrcreate(reln, xlrec->forkNum, true);
+
+		if (xlrec->forkNum == MAIN_FORKNUM)
+		{
+			reln->smgr_cached_nblocks[MAIN_FORKNUM] = 0;
+			reln->smgr_cached_nblocks[FSM_FORKNUM] = 0;
+			reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
+			// reln->smgr_cached_nblocks[INIT_FORKNUM] = 0;
+		}
 	}
 	else if (info == XLOG_SMGR_TRUNCATE)
 	{
diff --git a/src/test/recovery/t/060_truncate_empty.pl b/src/test/recovery/t/060_truncate_empty.pl
new file mode 100644
index 00000000000..b6b7852c5da
--- /dev/null
+++ b/src/test/recovery/t/060_truncate_empty.pl
@@ -0,0 +1,73 @@
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(gettimeofday tv_interval);
+
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init();
+
+$node->append_conf('postgresql.conf', 'shared_buffers = 4GB');
+$node->append_conf('postgresql.conf', 'restart_after_crash = on');
+
+$node->start();
+
+$node->safe_psql('postgres',
+    q[CREATE TABLE test (id int);]);
+
+# SIGSTOP checkpointer
+my $checkpointer_pid = $node->safe_psql('postgres',
+    q[SELECT pid FROM pg_stat_activity WHERE backend_type = 'checkpointer';]);
+chomp($checkpointer_pid);
+kill 'STOP', $checkpointer_pid;
+note("Checkpointer stopped");
+
+$node->pgbench(
+    '--no-vacuum --client=10 --transactions=500',
+    0,
+    [qr{actually processed}],
+    [qr{^$}],
+    'concurrent CREATE and DROP TABLE transactions',
+    {
+        'truncate_empty_script' => q(
+            BEGIN;
+            INSERT INTO test VALUES (:client_id);
+            DELETE FROM test WHERE id = :client_id;
+            CREATE TABLE test_empty_:client_id (id int);
+            DROP TABLE test_empty_:client_id;
+            COMMIT;
+        )
+    });
+
+# cause a system crash by killing WAL writer
+my $walwriter_pid = $node->safe_psql('postgres',
+    q[SELECT pid FROM pg_stat_activity WHERE backend_type = 'walwriter';]);
+chomp($walwriter_pid);
+note("WAL writer PID: $walwriter_pid");
+kill 'KILL', $walwriter_pid;
+
+my $recovery_start = [gettimeofday];
+$node->poll_query_until('postgres', undef, '');
+my $recovery_end = [gettimeofday];
+my $recovery_time = tv_interval($recovery_start, $recovery_end);
+
+note("Crash recovery time: ${recovery_time} seconds");
+
+my $log_content = $node->log_content();
+if ($log_content =~ /redo done at .+? system usage: CPU: user: ([\d.]+) s, system: ([\d.]+) s, elapsed: ([\d.]+) s/m)
+{
+    my $cpu_user = $1;
+    my $cpu_system = $2;
+    my $redo_elapsed = $3;
+    
+    note("Redo elapsed time: $redo_elapsed s");
+    note("  CPU user: $cpu_user s, system: $cpu_system s");
+}
+
+# consistency check
+my $result = $node->safe_psql('postgres', q[SELECT COUNT(*) FROM test;]);
+is($result, '0', 'test table is empty after recovery');
+
+$node->stop();
+done_testing();
-- 
2.50.1 (Apple Git-155)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Optimize CPU usage of dropping buffers during recovery
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox