Re: Optimize CPU usage of dropping buffers during recovery

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Optimize CPU usage of dropping buffers during recovery
2+ messages / 1 participants
[nested] [flat]

* Re: Optimize CPU usage of dropping buffers during recovery
@ 2026-02-01 08:20  Jingtang Zhang <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: Jingtang Zhang @ 2026-02-01 08:20 UTC (permalink / raw)
  To: [email protected]; pgsql-hackers


A little bit modification for Perl test case and comments in v2 patch.

--
Regards, Jingtang



Attachments:

  [application/octet-stream] v2-0001-Optimize-CPU-usage-of-dropping-buffers-during-recove.patch (4.1K, 2-v2-0001-Optimize-CPU-usage-of-dropping-buffers-during-recove.patch)
  download | inline diff:
From 2d423d2b88e0f14a2fb51657ddfd93f7648598ca Mon Sep 17 00:00:00 2001
From: Jingtang Zhang <[email protected]>
Date: Sun, 1 Feb 2026 16:12:48 +0800
Subject: [PATCH] Optimize CPU usage of dropping buffers during recovery

Initialize cached nblocks to 0 when redo CREATE record.
---
 src/backend/access/transam/xlogutils.c    |  7 +++
 src/backend/catalog/storage.c             |  7 +++
 src/test/recovery/t/060_truncate_empty.pl | 69 +++++++++++++++++++++++
 3 files changed, 83 insertions(+)
 create mode 100644 src/test/recovery/t/060_truncate_empty.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5fbe39133b8..387c7621218 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -489,6 +489,13 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 	 */
 	smgrcreate(smgr, forknum, true);
 
+	/*
+	 * Invalidate the cache if the cached value is 0, and let smgrnblocks ask
+	 * the kernel. The relation might be longer than the cached value due to
+	 * relation extension before crash.
+	 */
+	if (smgr->smgr_cached_nblocks[forknum] == 0)
+		smgr->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	lastblock = smgrnblocks(smgr, forknum);
 
 	if (blkno < lastblock)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index e443a4993c5..859f515411a 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -993,6 +993,13 @@ smgr_redo(XLogReaderState *record)
 
 		reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
 		smgrcreate(reln, xlrec->forkNum, true);
+
+		if (xlrec->forkNum == MAIN_FORKNUM)
+		{
+			reln->smgr_cached_nblocks[MAIN_FORKNUM] = 0;
+			reln->smgr_cached_nblocks[FSM_FORKNUM] = 0;
+			reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
+		}
 	}
 	else if (info == XLOG_SMGR_TRUNCATE)
 	{
diff --git a/src/test/recovery/t/060_truncate_empty.pl b/src/test/recovery/t/060_truncate_empty.pl
new file mode 100644
index 00000000000..1956733f05f
--- /dev/null
+++ b/src/test/recovery/t/060_truncate_empty.pl
@@ -0,0 +1,69 @@
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(gettimeofday tv_interval);
+
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init();
+
+$node->append_conf('postgresql.conf', 'shared_buffers = 4GB');
+$node->append_conf('postgresql.conf', 'restart_after_crash = on');
+
+$node->start();
+
+$node->safe_psql('postgres',
+    q[CREATE TABLE test (id int);]);
+
+# SIGSTOP checkpointer and run some transactions
+my $checkpointer_pid = $node->safe_psql('postgres',
+    q[SELECT pid FROM pg_stat_activity WHERE backend_type = 'checkpointer';]);
+chomp($checkpointer_pid);
+kill 'STOP', $checkpointer_pid;
+note("Checkpointer stopped");
+
+$node->pgbench(
+    '--no-vacuum --client=10 --transactions=1000',
+    0,
+    [qr{actually processed}],
+    [qr{^$}],
+    'concurrent CREATE and DROP TABLE transactions',
+    {
+        'truncate_empty_script' => q(
+            BEGIN;
+            INSERT INTO test VALUES (:client_id);
+            DELETE FROM test WHERE id = :client_id;
+            CREATE TABLE test_empty_:client_id (id int);
+            DROP TABLE test_empty_:client_id;
+            COMMIT;
+        )
+    });
+
+# stop the node in immediate mode for crash recovery
+$node->stop('immediate');
+
+my $recovery_start = [gettimeofday];
+$node->start();
+my $recovery_end = [gettimeofday];
+my $recovery_time = tv_interval($recovery_start, $recovery_end);
+
+note("Crash recovery time: ${recovery_time} seconds");
+
+my $log_content = $node->log_content();
+if ($log_content =~ /redo done at .+? system usage: CPU: user: ([\d.]+) s, system: ([\d.]+) s, elapsed: ([\d.]+) s/m)
+{
+    my $cpu_user = $1;
+    my $cpu_system = $2;
+    my $redo_elapsed = $3;
+    
+    note("Redo elapsed time: $redo_elapsed s");
+    note("  CPU user: $cpu_user s, system: $cpu_system s");
+}
+
+# consistency check
+my $result = $node->safe_psql('postgres', q[SELECT COUNT(*) FROM test;]);
+is($result, '0', 'test table is empty after recovery');
+
+$node->stop();
+done_testing();
-- 
2.50.1 (Apple Git-155)



^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Re: Optimize CPU usage of dropping buffers during recovery
@ 2026-06-04 07:19  Jingtang Zhang <[email protected]>
  parent: Jingtang Zhang <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Jingtang Zhang @ 2026-06-04 07:19 UTC (permalink / raw)
  To: [email protected]; pgsql-hackers

Hi hackers,

Reposting this patch to bump the thread -- the v2 I sent a
while ago didn't get any reviews.

To recap the problem: during crash recovery, when a transaction
involves creating and then dropping an empty relation (or a
relation that happens to be empty at the time of the crash),
DropRelationsAllBuffers() falls back to scanning the entire
buffer pool. This is because no WAL record ever references a
block of that relation, so smgr_cached_nblocks is never
initialized - it stays as InvalidBlockNumber, and the optimized
BufMapping lookup path introduced by commit d6ad34f3410 cannot
be used.

This can easily happen in practice. For example, refreshing a
materialized view whose query returns no rows, or truncating a
table and reloading it but the source query returns nothing.
With a large shared_buffers setting, the recovery time becomes
dominated by repeated full scans of the buffer pool.

The attached patch addresses this with a small change in
smgr_redo so that the optimized BufMapping path can be taken
in this scenario. The impact, measured on an earlier run with
16GB shared_buffers and 10 clients * 500 CREATE/DROP
transactions during crash recovery:

  w/o patch: CPU user: 77.58 s, system: 0.27 s
  patched:   CPU user: 0.14 s,  system: 0.09 s

In more detail, the fix has two parts that work together:

1. In smgr_redo(XLOG_SMGR_CREATE), set smgr_cached_nblocks to
   0 for the MAIN, FSM, and VM forks. A CREATE record means the
   relation was just created with zero blocks, so 0 is the
   correct cached value.

2. In XLogReadBufferExtended(), if the cached value is 0,
   invalidate it before calling smgrnblocks(), so that it does a
   fresh lseek to get the real file size. This handles the case
   where the relation was extended before the crash. The extra
   lseek is acceptable here because we are about to do I/O to
   read the block anyway.

For the common case where empty relations are created and
dropped without any data written, part (1) alone is sufficient:
the cached value stays 0, DropRelationsAllBuffers() sees 0
blocks to invalidate, takes the fast path, and returns
immediately. Part (2) is only needed as a safety net for the
less common case.

Regarding the INIT fork question I raised in v1: I now believe
we don't need to handle it. The CREATE record for an INIT fork
has forkNum == INIT_FORKNUM, not MAIN_FORKNUM, so the new
condition simply does not apply to it.

I've also verified that relfilenumber reuse within a single
recovery is not a concern. As documented in mdunlink(), the
first segment file is kept until the next checkpoint to prevent
relfilenumber reuse, and during redo there is no possibility of
allocating new relfilenumbers.

v3 attached, rebased on current master. Compared to v2, the
comments have been improved to better explain the relationship
between the two code changes and why the approach is safe.
Note that the Perl test case included in the patch is
only meant as a benchmark script for conveniently reproducing
the problem and measuring the improvement - it is not intended
as a regression test.

Any thoughts or reviews would be greatly appreciated.

--
Regards, Jingtang

Attachments:

  [application/octet-stream] v3-0001-Optimize-CPU-usage-of-dropping-buffers-during-recove.patch (5.0K, 2-v3-0001-Optimize-CPU-usage-of-dropping-buffers-during-recove.patch)
  download | inline diff:
From a9f04391473b3a88482433b881167d662027d968 Mon Sep 17 00:00:00 2001
From: Jingtang Zhang <[email protected]>
Date: Sat, 18 Apr 2026 21:07:13 +0800
Subject: [PATCH v4] Optimize CPU usage of dropping buffers during recovery

Initialize cached nblocks to 0 when redo CREATE record.
---
 src/backend/access/transam/xlogutils.c    | 10 ++++
 src/backend/catalog/storage.c             | 24 ++++++++
 src/test/recovery/t/060_truncate_empty.pl | 69 +++++++++++++++++++++++
 3 files changed, 103 insertions(+)
 create mode 100644 src/test/recovery/t/060_truncate_empty.pl

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5fbe39133b8..090947ba5fe 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -489,6 +489,16 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 	 */
 	smgrcreate(smgr, forknum, true);

+	/*
+	 * If the cached nblocks is 0, it was set by smgr_redo(CREATE) to
+	 * enable the optimized drop-buffer path.  But the relation may
+	 * have been extended before the crash, so we must invalidate the
+	 * cache and let smgrnblocks() do an lseek to get the real size.
+	 * This extra lseek is acceptable here because we're about to do
+	 * I/O to read the block anyway.
+	 */
+	if (smgr->smgr_cached_nblocks[forknum] == 0)
+		smgr->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	lastblock = smgrnblocks(smgr, forknum);

 	if (blkno < lastblock)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index e443a4993c5..b93a8b58753 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -993,6 +993,30 @@ smgr_redo(XLogReaderState *record)

 		reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
 		smgrcreate(reln, xlrec->forkNum, true);
+
+		/*
+		 * Initialize the cached nblocks to 0 for a newly created
+		 * relation, so that DropRelationsAllBuffers() can use the
+		 * optimized path (BufMapping lookup) instead of scanning
+		 * the entire buffer pool.
+		 *
+		 * This is safe because a CREATE record means the relation
+		 * has just been created with zero blocks.  If the relation
+		 * was extended before the crash, XLogReadBufferExtended()
+		 * will invalidate this cached value and let smgrnblocks()
+		 * do a fresh lseek.
+		 *
+		 * We only do this for MAIN_FORKNUM CREATE records, which
+		 * correspond to new relation creation.  FSM and VM forks
+		 * are also set to 0 because they cannot exist yet for a
+		 * newly created relation.
+		 */
+		if (xlrec->forkNum == MAIN_FORKNUM)
+		{
+			reln->smgr_cached_nblocks[MAIN_FORKNUM] = 0;
+			reln->smgr_cached_nblocks[FSM_FORKNUM] = 0;
+			reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
+		}
 	}
 	else if (info == XLOG_SMGR_TRUNCATE)
 	{
diff --git a/src/test/recovery/t/060_truncate_empty.pl b/src/test/recovery/t/060_truncate_empty.pl
new file mode 100644
index 00000000000..1956733f05f
--- /dev/null
+++ b/src/test/recovery/t/060_truncate_empty.pl
@@ -0,0 +1,69 @@
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(gettimeofday tv_interval);
+
+my $node = PostgreSQL::Test::Cluster->new('primary');
+$node->init();
+
+$node->append_conf('postgresql.conf', 'shared_buffers = 4GB');
+$node->append_conf('postgresql.conf', 'restart_after_crash = on');
+
+$node->start();
+
+$node->safe_psql('postgres',
+    q[CREATE TABLE test (id int);]);
+
+# SIGSTOP checkpointer and run some transactions
+my $checkpointer_pid = $node->safe_psql('postgres',
+    q[SELECT pid FROM pg_stat_activity WHERE backend_type = 'checkpointer';]);
+chomp($checkpointer_pid);
+kill 'STOP', $checkpointer_pid;
+note("Checkpointer stopped");
+
+$node->pgbench(
+    '--no-vacuum --client=10 --transactions=1000',
+    0,
+    [qr{actually processed}],
+    [qr{^$}],
+    'concurrent CREATE and DROP TABLE transactions',
+    {
+        'truncate_empty_script' => q(
+            BEGIN;
+            INSERT INTO test VALUES (:client_id);
+            DELETE FROM test WHERE id = :client_id;
+            CREATE TABLE test_empty_:client_id (id int);
+            DROP TABLE test_empty_:client_id;
+            COMMIT;
+        )
+    });
+
+# stop the node in immediate mode for crash recovery
+$node->stop('immediate');
+
+my $recovery_start = [gettimeofday];
+$node->start();
+my $recovery_end = [gettimeofday];
+my $recovery_time = tv_interval($recovery_start, $recovery_end);
+
+note("Crash recovery time: ${recovery_time} seconds");
+
+my $log_content = $node->log_content();
+if ($log_content =~ /redo done at .+? system usage: CPU: user: ([\d.]+) s, system: ([\d.]+) s, elapsed: ([\d.]+) s/m)
+{
+    my $cpu_user = $1;
+    my $cpu_system = $2;
+    my $redo_elapsed = $3;
+    
+    note("Redo elapsed time: $redo_elapsed s");
+    note("  CPU user: $cpu_user s, system: $cpu_system s");
+}
+
+# consistency check
+my $result = $node->safe_psql('postgres', q[SELECT COUNT(*) FROM test;]);
+is($result, '0', 'test table is empty after recovery');
+
+$node->stop();
+done_testing();
-- 
2.39.5 (Apple Git-154)

^ permalink  raw  reply  [nested|flat] 2+ messages in thread

end of thread, other threads:[~2026-06-04 07:19 UTC | newest]

Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-01 08:20 Re: Optimize CPU usage of dropping buffers during recovery Jingtang Zhang <[email protected]>
2026-06-04 07:19 ` Jingtang Zhang <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox