Lowering the default wal

public inbox for [email protected]  
help / color / mirror / Atom feed

Lowering the default wal_blocksize to 4K
2+ messages / 2 participants
[nested] [flat]

* Lowering the default wal_blocksize to 4K
@ 2023-10-09 23:08 Andres Freund <[email protected]>
  2023-10-10 10:57 ` Re: Lowering the default wal_blocksize to 4K Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: Andres Freund @ 2023-10-09 23:08 UTC (permalink / raw)
  To: pgsql-hackers; Heikki Linnakangas <[email protected]>; Robert Haas <[email protected]>; Thomas Munro <[email protected]>; Matthias van de Meent <[email protected]>

Hi,

I've mentioned this to a few people before, but forgot to start an actual
thread. So here we go:

I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from
the current 8192.  The reason is that

a) We don't gain much from a blocksize above 4096, as we already do one write
   all the pending WAL data in one go (except when at the tail of
   wal_buffers). We *do* incur more overhead for page headers, but compared to
   the actual WAL data it is not a lot (~0.29% of space is page headers 8192
   vs 0.59% with 4096).

b) Writing 8KB when we we have to flush a partially filled buffer can
   substantially increase write amplification. In a transactional workload,
   this will often double the write volume.

Currently disks mostly have 4096 bytes as their "sector size". Sometimes
that's exposed directly, sometimes they can also write in 512 bytes, but that
internally requires a read-modify-write operation.

For some example numbers, I ran a very simple insert workload with a varying
number of clients with both a wal_blocksize=4096 and wal_blocksize=8192
cluster, and measured the amount of bytes written before/after.  The table was
recreated before each run, followed by a checkpoint and the benchmark. Here I
ran the inserts only for 15s each, because the results don't change
meaningfully with longer runs.

With XLOG_BLCKSZ=8192

clients	     tps    disk bytes written
1	     667		 81296
2	     739		 89796
4	    1446		 89208
8	    2858		 90858
16	    5775		 96928
32	   11920		115351
64	   23686		135244
128	   46001		173390
256	   88833		239720
512	  146208		335669

With XLOG_BLCKSZ=4096

clients	     tps    disk bytes written
1	     751		 46838
2	     773		 47936
4	    1512		 48317
8	    3143		 52584
16	    6221		 59097
32	   12863		 73776
64	   25652		 98792
128	   48274		133330
256	   88969		200720
512	  146298		298523

This is on a not-that-fast NVMe SSD (Samsung SSD 970 PRO 1TB).

It's IMO quite interesting that even at the higher client counts, the number
of bytes written don't reach parity.

On a stripe of two very fast SSDs:

With XLOG_BLCKSZ=8192

clients	     tps    disk bytes written
1	   23786		2893392
2	   38515		4683336
4	   63436		4688052
8	  106618		4618760
16	  177905		4384360
32	  254890		3890664
64	  297113		3031568
128	  299878		2297808
256	  308774		1935064
512	  292515		1630408

With XLOG_BLCKSZ=4096

clients	     tps    disk bytes written
1	   25742		1586748
2	   43578		2686708
4	   62734		2613856
8	  116217		2809560
16	  200802		2947580
32	  269268		2461364
64	  323195		2042196
128	  317160		1550364
256	  309601		1285744
512	  292063		1103816

It's fun to see how the total number of writes *decreases* at higher
concurrency, because it becomes more likely that pages are filled completely.

One thing I noticed is that our auto-configuration of wal_buffers leads to
different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem
great.

Performing the same COPY workload (1024 files, split across N clients) for
both settings shows no performance difference, but a very slight increase in
total bytes written (about 0.25%, which is roughly what I'd expect).

Personally I'd say the slight increase in WAL volume is more than outweighed
by the increase in throughput and decrease in bytes written.

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

Greetings,

Andres Freund

^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Re: Lowering the default wal_blocksize to 4K
  2023-10-09 23:08 Lowering the default wal_blocksize to 4K Andres Freund <[email protected]>
@ 2023-10-10 10:57 ` Matthias van de Meent <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Matthias van de Meent @ 2023-10-10 10:57 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers; Heikki Linnakangas <[email protected]>; Robert Haas <[email protected]>; Thomas Munro <[email protected]>

On Tue, 10 Oct 2023 at 01:08, Andres Freund <[email protected]> wrote:
>
> Hi,
>
> I've mentioned this to a few people before, but forgot to start an actual
> thread. So here we go:
>
> I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from
> the current 8192.

Seems like a good idea.

> It's IMO quite interesting that even at the higher client counts, the number
> of bytes written don't reach parity.
>
> It's fun to see how the total number of writes *decreases* at higher
> concurrency, because it becomes more likely that pages are filled completely.

With higher client counts and short transactions I think it is not too
unexpected to see commit_delay+commit_siblings configured. Did you
measure the impact of this change on such configurations?

> One thing I noticed is that our auto-configuration of wal_buffers leads to
> different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem
> great.

Hmm.

> Performing the same COPY workload (1024 files, split across N clients) for
> both settings shows no performance difference, but a very slight increase in
> total bytes written (about 0.25%, which is roughly what I'd expect).
>
> Personally I'd say the slight increase in WAL volume is more than outweighed
> by the increase in throughput and decrease in bytes written.

Agreed.

> There's an alternative approach we could take, which is to write in 4KB
> increments, while keeping 8KB pages. With the current format that's not
> obviously a bad idea. But given there aren't really advantages in 8KB WAL
> pages, it seems we should just go for 4KB?

It is not just the disk overhead of blocks, but we also maintain some
other data (currently in the form of XLogRecPtrs) in memory for each
WAL buffer, the overhead of which will also increase when we increase
the number of XLog pages per MB of WAL that we cache.
Additionally, highly concurrent workloads with transactions that write
a high multiple of XLOG_BLCKSZ bytes to WAL may start to see increased
overhead due to the .25% additional WAL getting written and a doubling
of the number of XLog pages being touched (both initialization and the
smaller memcpy for records that would now cross an extra page
boundary).

However, for all of these issues I doubt that they actually matter
much in the grand scheme of things, so I definitely wouldn't mind
moving to 4KiB XLog pages.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

^ permalink  raw  reply  [nested|flat] 2+ messages in thread

end of thread, other threads:[~2023-10-10 10:57 UTC | newest]

Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2023-10-09 23:08 Lowering the default wal_blocksize to 4K Andres Freund <[email protected]>
2023-10-10 10:57 ` Matthias van de Meent <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox