Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qpzMF-007n6s-OS for pgsql-hackers@arkaria.postgresql.org; Mon, 09 Oct 2023 23:08:16 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1qpzMC-001sLg-Lc for pgsql-hackers@arkaria.postgresql.org; Mon, 09 Oct 2023 23:08:13 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qpzMB-001sLL-Jz for pgsql-hackers@lists.postgresql.org; Mon, 09 Oct 2023 23:08:13 +0000 Received: from out4-smtp.messagingengine.com ([66.111.4.28]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1qpzM9-000rEY-Kf for pgsql-hackers@postgresql.org; Mon, 09 Oct 2023 23:08:11 +0000 Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id 5DA045C0311; Mon, 9 Oct 2023 19:08:08 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute3.internal (MEProxy); Mon, 09 Oct 2023 19:08:08 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:content-type:content-type:date:date:from:from:in-reply-to :message-id:mime-version:reply-to:sender:subject:subject:to:to; s=fm2; t=1696892888; x=1696979288; bh=2BXCh0W4+SCemTpzVFWfSTXv2 Sr4aGW1AQLnCHnzd8M=; b=Af8TdmQUV0WY7/byqfuSRVgkayfvvOiaM/zgPu7qD 6U1j0wXwndIWe0yf0z++FXPm+LAPoCZ9vy7HZkeYP0YmqxekqsGYz03E+n5GUdpi ovCUarLQApCu8PuRIsShMkGemjmUllYFCzdPC44LL4Axi0nsgrDcROoXRgDR4nAV TbLndqQ20wx16D+IdFxCCUYkkrdVPfCVBeE3FPfhhjcX9VNJT2s9YZUhqllgMUZ0 7xrQ90wqXsrFSeTOXOrxGWZkqmhGPxCZHHP4utNiepwlciozX4d2Endd6j9iEZzn SQCsJtWaS2DvvgOZNAZ09PUKJKBCnlvdtUfsbd8hhgErQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:message-id :mime-version:reply-to:sender:subject:subject:to:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1696892888; x=1696979288; bh=2BXCh0W4+SCemTpzVFWfSTXv2Sr4aGW1AQL nCHnzd8M=; b=Srh66mbUQsG+m/8Zh+EVkt5wAzwbUNoXKZl0XlOxWAp/ttIBxkP Tvgk3VXKdleKO+RJxLISgDiMa5yWxLvQkmZ16PkPsvzjfSUOKFB+pOPek06Hr1Se qW1PXnuFbrq3+DD2MJ/TJklwpAoK9X7VHFR4BduDo39IJohV296oM8M3vATdt3UG 0ZEeToJmocLCibKxLJHPfCNLhTrQO8uGlb9s5/pPIK1g3wDeyrYigjRMUF/tKSFO 9doHN7VerBEKIXiW9wZrECg64TDJpGGl4ikIYTD81cPKurHOEGoo7dJq4vQTWn3g Q7SrjRm4F/H+yj+TiPTOU9nEfaZSAj0hzew== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvkedrheeggddujecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpeffhffvuffkgggtugesthdtredttddtvdenucfhrhhomheptehnughrvghsucfh rhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrthhtvg hrnhepiedvieelgeeuuedtfeduhfefteehhfevvdeljeetgfeugfdtledtudetvdehkeff necuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprghnug hrvghssegrnhgrrhgriigvlhdruggv X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 9 Oct 2023 19:08:07 -0400 (EDT) Date: Mon, 9 Oct 2023 16:08:05 -0700 From: Andres Freund To: pgsql-hackers@postgresql.org, Heikki Linnakangas , Robert Haas , Thomas Munro , Matthias van de Meent Subject: Lowering the default wal_blocksize to 4K Message-ID: <20231009230805.funj5ipoggjyzjz6@awork3.anarazel.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, I've mentioned this to a few people before, but forgot to start an actual thread. So here we go: I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from the current 8192. The reason is that a) We don't gain much from a blocksize above 4096, as we already do one write all the pending WAL data in one go (except when at the tail of wal_buffers). We *do* incur more overhead for page headers, but compared to the actual WAL data it is not a lot (~0.29% of space is page headers 8192 vs 0.59% with 4096). b) Writing 8KB when we we have to flush a partially filled buffer can substantially increase write amplification. In a transactional workload, this will often double the write volume. Currently disks mostly have 4096 bytes as their "sector size". Sometimes that's exposed directly, sometimes they can also write in 512 bytes, but that internally requires a read-modify-write operation. For some example numbers, I ran a very simple insert workload with a varying number of clients with both a wal_blocksize=4096 and wal_blocksize=8192 cluster, and measured the amount of bytes written before/after. The table was recreated before each run, followed by a checkpoint and the benchmark. Here I ran the inserts only for 15s each, because the results don't change meaningfully with longer runs. With XLOG_BLCKSZ=8192 clients tps disk bytes written 1 667 81296 2 739 89796 4 1446 89208 8 2858 90858 16 5775 96928 32 11920 115351 64 23686 135244 128 46001 173390 256 88833 239720 512 146208 335669 With XLOG_BLCKSZ=4096 clients tps disk bytes written 1 751 46838 2 773 47936 4 1512 48317 8 3143 52584 16 6221 59097 32 12863 73776 64 25652 98792 128 48274 133330 256 88969 200720 512 146298 298523 This is on a not-that-fast NVMe SSD (Samsung SSD 970 PRO 1TB). It's IMO quite interesting that even at the higher client counts, the number of bytes written don't reach parity. On a stripe of two very fast SSDs: With XLOG_BLCKSZ=8192 clients tps disk bytes written 1 23786 2893392 2 38515 4683336 4 63436 4688052 8 106618 4618760 16 177905 4384360 32 254890 3890664 64 297113 3031568 128 299878 2297808 256 308774 1935064 512 292515 1630408 With XLOG_BLCKSZ=4096 clients tps disk bytes written 1 25742 1586748 2 43578 2686708 4 62734 2613856 8 116217 2809560 16 200802 2947580 32 269268 2461364 64 323195 2042196 128 317160 1550364 256 309601 1285744 512 292063 1103816 It's fun to see how the total number of writes *decreases* at higher concurrency, because it becomes more likely that pages are filled completely. One thing I noticed is that our auto-configuration of wal_buffers leads to different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem great. Performing the same COPY workload (1024 files, split across N clients) for both settings shows no performance difference, but a very slight increase in total bytes written (about 0.25%, which is roughly what I'd expect). Personally I'd say the slight increase in WAL volume is more than outweighed by the increase in throughput and decrease in bytes written. There's an alternative approach we could take, which is to write in 4KB increments, while keeping 8KB pages. With the current format that's not obviously a bad idea. But given there aren't really advantages in 8KB WAL pages, it seems we should just go for 4KB? Greetings, Andres Freund