Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vkn4s-002ARD-0V for pgsql-hackers@arkaria.postgresql.org; Tue, 27 Jan 2026 17:42:10 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vkn4r-00FJzj-0q for pgsql-hackers@arkaria.postgresql.org; Tue, 27 Jan 2026 17:42:09 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vkn4q-00FJzb-2i for pgsql-hackers@lists.postgresql.org; Tue, 27 Jan 2026 17:42:09 +0000 Received: from mail-pf1-x435.google.com ([2607:f8b0:4864:20::435]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vkn4o-002bbh-1s for pgsql-hackers@postgresql.org; Tue, 27 Jan 2026 17:42:08 +0000 Received: by mail-pf1-x435.google.com with SMTP id d2e1a72fcca58-8230c33f477so2643406b3a.2 for ; Tue, 27 Jan 2026 09:42:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1769535726; x=1770140526; darn=postgresql.org; h=to:date:message-id:subject:mime-version:from:from:to:cc:subject :date:message-id:reply-to; bh=HUqG/nxIaCwcOV1VjFemhwPw/gWygpHm6/v6ZTbgQNk=; b=Kqb+KX7YU8bz4tk2/J9Aq8Ph0OpnkJvx0soZ4C5Ibcb6/98j7cT2GGUJLe1oKhUJVp F8qWsTpkznEmMLxD0PKy4iSoWIOzuuBEEpI8golmNwaja+XlvBZ3hrqqEQl3680bKOW2 5RVa/zDZ8rG+u5txkZ0/NPOM6aSDMecEFEil9pd5kahJVfD60gShdGf3OqEaf0vzjD8K prDIRJIIXS1CQ3IBcZ5IBT0cuQB7TYdy5SvKJKSo9f2abMfOykh013e+PRVwmpR/rCeE Vmg9Yi+nmkKbfUB1TCRVEF/bgDsPNZXhFGfWdytPKi8eOjrE+sN1xqzyzTsvDAwtPfxw kwJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769535726; x=1770140526; h=to:date:message-id:subject:mime-version:from:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HUqG/nxIaCwcOV1VjFemhwPw/gWygpHm6/v6ZTbgQNk=; b=nHoE2cATYqskeysnGkJ/A4okHSkFmErcwoNB/ikJvWH2mNAk1yY1eLKgwCF4tTTmU4 tA1LWVwC0r15GmOcE5CpsknR9XOdDZJ/lryqxFbjwFW3HSORiMKzNgKOGQPeBQIfE1Ob eIhPRuNAvFfDYrvgUdBoPnkwcF7LoXPbTZwjEEjaIrkLZk+cFIq3QN9sEc+FMw6wyaFJ saPuhL+RgvUBDuLJggKWyhDGqmrbUZXj4oPsGq9w0L0vFItGN+8UF3QMD8UR1KJtPGMT 1+CqSQ6Y6XOf/kU2Bhtis7+R0nPW2uMxM/wO3mY9CheeGnpXPLzHuyK4VHXwKEYxDO4g ejjw== X-Forwarded-Encrypted: i=1; AJvYcCXQ19gD+FqrdBml0TP8HxkkeNpkVJboavZmU4XRsAXIiUjEGmMbb2yEGIkz9IRiuIlzQDd8XQcsx2cMUqxQ@postgresql.org X-Gm-Message-State: AOJu0YxWqRIsyzxOHZfWWQAqLX2eOgNmHhPW3j/CUcEBxgFUz0K1Gomr /I4WaAqqLTNXvaP/r5+lssgDDMqQERn6pOJZhBUZqxLGOu0neySi9xPN X-Gm-Gg: AZuq6aJ/CLgHtYzgCcPNnJD1goS9PEyBVsnORZx1pvwCv/i6fMx5dugoCE/CPYJw4yl fdTA+5cpZF6du5oja/K0TGe8GFwB6L8q6w23ZJgqgg+yUlx2TpNW88VYuj+z1WGb0vgzKCSCzNz KAf4sizeamgK1pJocRKZj1H5a/B1MJvYoZhKD4Mk3MF8965QHkmw1bThRUJBaMWOumlDk1YahCH KMeN9BWy+oJys45/3vFUFg+/SnLecqpKtQTB3+saPK0uqSav7D/OLrAGnMGuCqePUf0eyPFN26j RMT+DYXrmOy/vAV8vjDLeUKf3PQDI+ujnjlyygqY3r2BtXHwKbzWgr22+MXpWg9v6M8TUK8Y0aY Ojdnewly2QEADOJ2631afq2u9Aj47ywFxYJYYXyutUSg8r5fb/demEj454sDIOLq7YAqRK/Pv7f Esl/hCtTym+KFBvxvW4u2nchDSw5HlqNVytdQqhgv4FWSq9zB/sKUY X-Received: by 2002:aa7:88c8:0:b0:81f:477d:58db with SMTP id d2e1a72fcca58-823692919a4mr2304455b3a.39.1769535726336; Tue, 27 Jan 2026 09:42:06 -0800 (PST) Received: from smtpclient.apple (n058152109156.netvigator.com. [58.152.109.156]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-c642487f1f5sm74165a12.0.2026.01.27.09.42.04 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 27 Jan 2026 09:42:05 -0800 (PST) From: Jingtang Zhang Content-Type: multipart/mixed; boundary="Apple-Mail=_8E9888A9-26D9-4147-B91B-908FCEDD9ABF" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.700.81.1.4\)) Subject: Optimize CPU usage of dropping buffers during recovery Message-Id: <77956427-F0BC-4B2C-9114-81859CDA8E06@gmail.com> Date: Wed, 28 Jan 2026 01:41:52 +0800 To: pgsql-hackers@lists.postgresql.org, PostgreSQL-development X-Mailer: Apple Mail (2.3826.700.81.1.4) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --Apple-Mail=_8E9888A9-26D9-4147-B91B-908FCEDD9ABF Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Hi hackers, I happen to find a case where a transaction involves dropping a relation file, and a relation happens to be empty, the WAL recovery will waste a lot of time on scanning the whole buffer pool. In DropRelationBuffers/DropRelationsAllBuffers, we will take fast path to avoid scanning whole buffer pool, if the buffers to be invalidated are less than typically 1/32 of buffer pool. However, since no more WAL records are referencing an empty relation, if we want to drop it, = startup process will never initialize smgr_cached_nblocks for that relation, causing fallback to slow path. The situation can be like refreshing a materialized view whose query returns no row, or truncating a table and then loading it but the query returns no row. It will generate WAL sequence like: rmgr: Standby len (rec/tot): 42/ 42, tx: 780, lsn: = 0/41238520, prev 0/41238490, desc: LOCK xid 780 db 16384 rel 16416 rmgr: Storage len (rec/tot): 42/ 42, tx: 780, lsn: = 0/412385F0, prev 0/412385C0, desc: CREATE base/16384/16416 ... rmgr: Transaction len (rec/tot): 1073/ 1073, tx: 783, lsn: = 0/4136BF70, prev 0/41369F60, desc: COMMIT 2026-01-27 08:07:49.953301 = UTC; rels: base/16384/16416 This makes me think about initializing the cached nblocks to zero when redo CREATE record. Note that the 0-size may not correct, because the relation may have been zeroextended before server crash, and recovery starts from an earlier checkpoint. But we can correct that by = invalidating the cached value when we really meet a WAL record referencing a block of that relation, so that a later smgrnblocks() will fill-in the correct file length. It can avoid extending the relation in the middle of a file by mistake. Quickly proposed a patch for it, with a Perl program for judging the performance difference. In my environment with 16GB shared buffers, -O2 optimized, debugging = off, I've run a test of 10 client * 500 transactions using such workload, = made the server crash and poll it until recovery is done. The CPU time: w/o patch: CPU user: 77.58 s, system: 0.27 s patched: CPU user: 0.14 s, system: 0.09 s Without the patch, the CPU is wasted on BufTagMatchesRelFileLocator: Overhead Command Shared Object Symbol - 98.23% postgres postgres [.] DropRelationsAllBuffers 98.21% DropRelationsAllBuffers smgrdounlinkall DropRelationFiles xact_redo_commit xact_redo PerformWalRecovery StartupXLOG StartupProcessMain postmaster_child_launch StartChildProcess PostmasterStateMachine ServerLoop.isra.0 0.06% postgres postgres [.] = hash_search_with_hash_value Not sure if there will be any problem of correctness. Also, not sure = about what to do about INIT fork. Any thoughts? -- Regards, Jingtang --Apple-Mail=_8E9888A9-26D9-4147-B91B-908FCEDD9ABF Content-Disposition: attachment; filename=0001-Optimize-CPU-usage-of-dropping-buffers-during-recove.patch Content-Type: application/octet-stream; x-unix-mode=0644; name="0001-Optimize-CPU-usage-of-dropping-buffers-during-recove.patch" Content-Transfer-Encoding: quoted-printable =46rom=202d8caed717a4ee68e045cb334376859049518789=20Mon=20Sep=2017=20= 00:00:00=202001=0AFrom:=20Jingtang=20Zhang=20=0A= Date:=20Wed,=2028=20Jan=202026=2001:36:21=20+0800=0ASubject:=20[PATCH]=20= Optimize=20CPU=20usage=20of=20dropping=20buffers=20during=20recovery=0A=0A= Initialize=20cached=20nblocks=20to=200=20when=20redo=20CREATE=20record.=0A= ---=0A=20src/backend/access/transam/xlogutils.c=20=20=20=20|=20=202=20+=0A= =20src/backend/catalog/storage.c=20=20=20=20=20=20=20=20=20=20=20=20=20|=20= =208=20+++=0A=20src/test/recovery/t/060_truncate_empty.pl=20|=2073=20= +++++++++++++++++++++++=0A=203=20files=20changed,=2083=20insertions(+)=0A= =20create=20mode=20100644=20src/test/recovery/t/060_truncate_empty.pl=0A=0A= diff=20--git=20a/src/backend/access/transam/xlogutils.c=20= b/src/backend/access/transam/xlogutils.c=0Aindex=20= 5fbe39133b8..90627683935=20100644=0A---=20= a/src/backend/access/transam/xlogutils.c=0A+++=20= b/src/backend/access/transam/xlogutils.c=0A@@=20-489,6=20+489,8=20@@=20= XLogReadBufferExtended(RelFileLocator=20rlocator,=20ForkNumber=20= forknum,=0A=20=09=20*/=0A=20=09smgrcreate(smgr,=20forknum,=20true);=0A=20= =0A+=09if=20(smgr->smgr_cached_nblocks[forknum]=20=3D=3D=200)=0A+=09=09= smgr->smgr_cached_nblocks[forknum]=20=3D=20InvalidBlockNumber;=0A=20=09= lastblock=20=3D=20smgrnblocks(smgr,=20forknum);=0A=20=0A=20=09if=20= (blkno=20<=20lastblock)=0Adiff=20--git=20a/src/backend/catalog/storage.c=20= b/src/backend/catalog/storage.c=0Aindex=20e443a4993c5..6e35b2bfbdb=20= 100644=0A---=20a/src/backend/catalog/storage.c=0A+++=20= b/src/backend/catalog/storage.c=0A@@=20-993,6=20+993,14=20@@=20= smgr_redo(XLogReaderState=20*record)=0A=20=0A=20=09=09reln=20=3D=20= smgropen(xlrec->rlocator,=20INVALID_PROC_NUMBER);=0A=20=09=09= smgrcreate(reln,=20xlrec->forkNum,=20true);=0A+=0A+=09=09if=20= (xlrec->forkNum=20=3D=3D=20MAIN_FORKNUM)=0A+=09=09{=0A+=09=09=09= reln->smgr_cached_nblocks[MAIN_FORKNUM]=20=3D=200;=0A+=09=09=09= reln->smgr_cached_nblocks[FSM_FORKNUM]=20=3D=200;=0A+=09=09=09= reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM]=20=3D=200;=0A+=09=09=09= //=20reln->smgr_cached_nblocks[INIT_FORKNUM]=20=3D=200;=0A+=09=09}=0A=20=09= }=0A=20=09else=20if=20(info=20=3D=3D=20XLOG_SMGR_TRUNCATE)=0A=20=09{=0A= diff=20--git=20a/src/test/recovery/t/060_truncate_empty.pl=20= b/src/test/recovery/t/060_truncate_empty.pl=0Anew=20file=20mode=20100644=0A= index=2000000000000..b6b7852c5da=0A---=20/dev/null=0A+++=20= b/src/test/recovery/t/060_truncate_empty.pl=0A@@=20-0,0=20+1,73=20@@=0A= +use=20strict;=0A+use=20warnings=20FATAL=20=3D>=20'all';=0A+use=20= PostgreSQL::Test::Cluster;=0A+use=20PostgreSQL::Test::Utils;=0A+use=20= Test::More;=0A+use=20Time::HiRes=20qw(gettimeofday=20tv_interval);=0A+=0A= +my=20$node=20=3D=20PostgreSQL::Test::Cluster->new('primary');=0A= +$node->init();=0A+=0A+$node->append_conf('postgresql.conf',=20= 'shared_buffers=20=3D=204GB');=0A+$node->append_conf('postgresql.conf',=20= 'restart_after_crash=20=3D=20on');=0A+=0A+$node->start();=0A+=0A= +$node->safe_psql('postgres',=0A+=20=20=20=20q[CREATE=20TABLE=20test=20= (id=20int);]);=0A+=0A+#=20SIGSTOP=20checkpointer=0A+my=20= $checkpointer_pid=20=3D=20$node->safe_psql('postgres',=0A+=20=20=20=20= q[SELECT=20pid=20FROM=20pg_stat_activity=20WHERE=20backend_type=20=3D=20= 'checkpointer';]);=0A+chomp($checkpointer_pid);=0A+kill=20'STOP',=20= $checkpointer_pid;=0A+note("Checkpointer=20stopped");=0A+=0A= +$node->pgbench(=0A+=20=20=20=20'--no-vacuum=20--client=3D10=20= --transactions=3D500',=0A+=20=20=20=200,=0A+=20=20=20=20[qr{actually=20= processed}],=0A+=20=20=20=20[qr{^$}],=0A+=20=20=20=20'concurrent=20= CREATE=20and=20DROP=20TABLE=20transactions',=0A+=20=20=20=20{=0A+=20=20=20= =20=20=20=20=20'truncate_empty_script'=20=3D>=20q(=0A+=20=20=20=20=20=20=20= =20=20=20=20=20BEGIN;=0A+=20=20=20=20=20=20=20=20=20=20=20=20INSERT=20= INTO=20test=20VALUES=20(:client_id);=0A+=20=20=20=20=20=20=20=20=20=20=20= =20DELETE=20FROM=20test=20WHERE=20id=20=3D=20:client_id;=0A+=20=20=20=20=20= =20=20=20=20=20=20=20CREATE=20TABLE=20test_empty_:client_id=20(id=20= int);=0A+=20=20=20=20=20=20=20=20=20=20=20=20DROP=20TABLE=20= test_empty_:client_id;=0A+=20=20=20=20=20=20=20=20=20=20=20=20COMMIT;=0A= +=20=20=20=20=20=20=20=20)=0A+=20=20=20=20});=0A+=0A+#=20cause=20a=20= system=20crash=20by=20killing=20WAL=20writer=0A+my=20$walwriter_pid=20=3D=20= $node->safe_psql('postgres',=0A+=20=20=20=20q[SELECT=20pid=20FROM=20= pg_stat_activity=20WHERE=20backend_type=20=3D=20'walwriter';]);=0A= +chomp($walwriter_pid);=0A+note("WAL=20writer=20PID:=20$walwriter_pid");=0A= +kill=20'KILL',=20$walwriter_pid;=0A+=0A+my=20$recovery_start=20=3D=20= [gettimeofday];=0A+$node->poll_query_until('postgres',=20undef,=20'');=0A= +my=20$recovery_end=20=3D=20[gettimeofday];=0A+my=20$recovery_time=20=3D=20= tv_interval($recovery_start,=20$recovery_end);=0A+=0A+note("Crash=20= recovery=20time:=20${recovery_time}=20seconds");=0A+=0A+my=20= $log_content=20=3D=20$node->log_content();=0A+if=20($log_content=20=3D~=20= /redo=20done=20at=20.+?=20system=20usage:=20CPU:=20user:=20([\d.]+)=20s,=20= system:=20([\d.]+)=20s,=20elapsed:=20([\d.]+)=20s/m)=0A+{=0A+=20=20=20=20= my=20$cpu_user=20=3D=20$1;=0A+=20=20=20=20my=20$cpu_system=20=3D=20$2;=0A= +=20=20=20=20my=20$redo_elapsed=20=3D=20$3;=0A+=20=20=20=20=0A+=20=20=20=20= note("Redo=20elapsed=20time:=20$redo_elapsed=20s");=0A+=20=20=20=20= note("=20=20CPU=20user:=20$cpu_user=20s,=20system:=20$cpu_system=20s");=0A= +}=0A+=0A+#=20consistency=20check=0A+my=20$result=20=3D=20= $node->safe_psql('postgres',=20q[SELECT=20COUNT(*)=20FROM=20test;]);=0A= +is($result,=20'0',=20'test=20table=20is=20empty=20after=20recovery');=0A= +=0A+$node->stop();=0A+done_testing();=0A--=20=0A2.50.1=20(Apple=20= Git-155)=0A=0A= --Apple-Mail=_8E9888A9-26D9-4147-B91B-908FCEDD9ABF--