Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
13+ messages / 4 participants
[nested] [flat]

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-18 10:57  Pierre Barre <[email protected]>
  0 siblings, 3 replies; 13+ messages in thread

From: Pierre Barre @ 2025-07-18 10:57 UTC (permalink / raw)
  To: Seref Arikan <[email protected]>; +Cc: [email protected]

Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following setup:

- A Hetzner s3 bucket in the FSN1 region
- A virtual machine of type ccx63 48 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs L2ARC
- Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,
Pierre

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
> Sorry, this was meant to go to the whole group:
> 
> Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?
> 
> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <[email protected]> wrote:
>> Hi everyone,
>> 
>> I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.
>> 
>> ZeroFS: https://github.com/Barre/ZeroFS
>> 
>> # The Architecture
>> 
>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:
>> 
>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>> 
>> By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.
>> 
>> ## Performance Results
>> 
>> Here are pgbench results from PostgreSQL running on this setup:
>> 
>> ### Read/Write Workload
>> 
>> ```
>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> starting vacuum...end.
>> transaction type: <builtin: TPC-B (sort of)>
>> scaling factor: 50
>> query mode: simple
>> number of clients: 50
>> number of threads: 15
>> maximum number of tries: 1
>> number of transactions per client: 100000
>> number of transactions actually processed: 5000000/5000000
>> number of failed transactions: 0 (0.000%)
>> latency average = 0.943 ms
>> initial connection time = 48.043 ms
>> tps = 53041.006947 (without initial connection time)
>> ```
>> 
>> ### Read-Only Workload
>> 
>> ```
>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> starting vacuum...end.
>> transaction type: <builtin: select only>
>> scaling factor: 50
>> query mode: simple
>> number of clients: 50
>> number of threads: 15
>> maximum number of tries: 1
>> number of transactions per client: 100000
>> number of transactions actually processed: 5000000/5000000
>> number of failed transactions: 0 (0.000%)
>> latency average = 0.121 ms
>> initial connection time = 53.358 ms
>> tps = 413436.248089 (without initial connection time)
>> ```
>> 
>> These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.
>> 
>> ## How It Works
>> 
>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>> 2. Multiple cache layers hide S3 latency:
>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>    c. Optional local disk cache
>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>> 
>> ## Geo-Distributed PostgreSQL
>> 
>> Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.
>> 
>> Example architectures:
>> 
>> Architecture 1
>> 
>> 
>>                          PostgreSQL Client
>>                                    |
>>                                    | SQL queries
>>                                    |
>>                             +--------------+
>>                             |  PG Proxy    |
>>                             | (HAProxy/    |
>>                             |  PgBouncer)  |
>>                             +--------------+
>>                                /        \
>>                               /          \
>>                    Synchronous            Synchronous
>>                    Replication            Replication
>>                             /              \
>>                            /                \
>>               +---------------+        +---------------+
>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>>               | (Primary)     |◄------►| (Standby)     |
>>               +---------------+        +---------------+
>>                       |                        |
>>                       |  POSIX filesystem ops  |
>>                       |                        |
>>               +---------------+        +---------------+
>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>>               | (3-way mirror)|        | (3-way mirror)|
>>               +---------------+        +---------------+
>>                /      |      \          /      |      \
>>               /       |       \        /       |       \
>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
>>              |        |        |           |        |        |
>>         +--------++--------++--------++--------++--------++--------+
>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>         +--------++--------++--------++--------++--------++--------+
>>              |         |         |         |         |         |
>>              |         |         |         |         |         |
>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>> 
>> Architecture 2:
>> 
>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>                 \                    /
>>                  \                  /
>>                   Same ZFS Pool (NBD)
>>                          |
>>                   6 Global ZeroFS
>>                          |
>>                       S3 Regions
>> 
>> 
>> The main advantages I see are:
>> 1. Dramatic cost reduction for large datasets
>> 2. Simplified geo-distribution 
>> 3. Infinite storage capacity
>> 4. Built-in encryption and compression
>> 
>> Looking forward to your feedback and questions!
>> 
>> Best,
>> Pierre
>> 
>> P.S. The full project includes a custom NFS filesystem too.
>> 


^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-18 12:31  Pierre Barre <[email protected]>
  parent: Pierre Barre <[email protected]>
  2 siblings, 0 replies; 13+ messages in thread

From: Pierre Barre @ 2025-07-18 12:31 UTC (permalink / raw)
  To: ; +Cc: [email protected]

Now, I'm trying to understand how CAP theorem applies here. Traditional PostgreSQL replication has clear CAP trade-offs - you choose between consistency and availability during partitions.

But when PostgreSQL instances share storage rather than replicate:
- Consistency seems maintained (same data)
- Availability seems maintained (client can always promote an accessible node)
- Partitions between PostgreSQL nodes don't prevent the system from functioning

It seems that CAP assumes specific implementation details (like nodes maintaining independent state) without explicitly stating them.

How should we think about CAP theorem when distributed nodes share storage rather than coordinate state? Are the trade-offs simply moved to a different layer, or does shared storage fundamentally change the analysis?

Client with awareness of both PostgreSQL nodes
    |                               |
    ↓ (partition here)              ↓
PostgreSQL Primary              PostgreSQL Standby
    |                               |
    └───────────┬───────────────────┘
                ↓
         Shared ZFS Pool
                |
         6 Global ZeroFS instances

Best,
Pierre

On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
> Hi Seref,
> 
> For the benchmarks, I used Hetzner's cloud service with the following setup:
> 
> - A Hetzner s3 bucket in the FSN1 region
> - A virtual machine of type ccx63 48 vCPU 192 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB zfs L2ARC
> - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
> 
> Best,
> Pierre
> 
> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>> Sorry, this was meant to go to the whole group:
>> 
>> Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?
>> 
>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <[email protected]> wrote:
>>> Hi everyone,
>>> 
>>> I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.
>>> 
>>> ZeroFS: https://github.com/Barre/ZeroFS
>>> 
>>> # The Architecture
>>> 
>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:
>>> 
>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>> 
>>> By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.
>>> 
>>> ## Performance Results
>>> 
>>> Here are pgbench results from PostgreSQL running on this setup:
>>> 
>>> ### Read/Write Workload
>>> 
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: TPC-B (sort of)>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.943 ms
>>> initial connection time = 48.043 ms
>>> tps = 53041.006947 (without initial connection time)
>>> ```
>>> 
>>> ### Read-Only Workload
>>> 
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: select only>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.121 ms
>>> initial connection time = 53.358 ms
>>> tps = 413436.248089 (without initial connection time)
>>> ```
>>> 
>>> These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.
>>> 
>>> ## How It Works
>>> 
>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> 2. Multiple cache layers hide S3 latency:
>>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>>    c. Optional local disk cache
>>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>>> 
>>> ## Geo-Distributed PostgreSQL
>>> 
>>> Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.
>>> 
>>> Example architectures:
>>> 
>>> Architecture 1
>>> 
>>> 
>>>                          PostgreSQL Client
>>>                                    |
>>>                                    | SQL queries
>>>                                    |
>>>                             +--------------+
>>>                             |  PG Proxy    |
>>>                             | (HAProxy/    |
>>>                             |  PgBouncer)  |
>>>                             +--------------+
>>>                                /        \
>>>                               /          \
>>>                    Synchronous            Synchronous
>>>                    Replication            Replication
>>>                             /              \
>>>                            /                \
>>>               +---------------+        +---------------+
>>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>>>               | (Primary)     |◄------►| (Standby)     |
>>>               +---------------+        +---------------+
>>>                       |                        |
>>>                       |  POSIX filesystem ops  |
>>>                       |                        |
>>>               +---------------+        +---------------+
>>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>>>               | (3-way mirror)|        | (3-way mirror)|
>>>               +---------------+        +---------------+
>>>                /      |      \          /      |      \
>>>               /       |       \        /       |       \
>>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
>>>              |        |        |           |        |        |
>>>         +--------++--------++--------++--------++--------++--------+
>>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>>         +--------++--------++--------++--------++--------++--------+
>>>              |         |         |         |         |         |
>>>              |         |         |         |         |         |
>>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>> 
>>> Architecture 2:
>>> 
>>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>>                 \                    /
>>>                  \                  /
>>>                   Same ZFS Pool (NBD)
>>>                          |
>>>                   6 Global ZeroFS
>>>                          |
>>>                       S3 Regions
>>> 
>>> 
>>> The main advantages I see are:
>>> 1. Dramatic cost reduction for large datasets
>>> 2. Simplified geo-distribution
>>> 3. Infinite storage capacity
>>> 4. Built-in encryption and compression
>>> 
>>> Looking forward to your feedback and questions!
>>> 
>>> Best,
>>> Pierre
>>> 
>>> P.S. The full project includes a custom NFS filesystem too.
>>> 
> 






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-18 12:55  Seref Arikan <[email protected]>
  parent: Pierre Barre <[email protected]>
  2 siblings, 1 reply; 13+ messages in thread

From: Seref Arikan @ 2025-07-18 12:55 UTC (permalink / raw)
  To: Pierre Barre <[email protected]>; +Cc: [email protected]

Thanks, I learned something else: I didn't know Hetzner offered S3
compatible storage.

The interesting thing is, a few searches about the performance return
mostly negative impressions about their object storage in comparison to the
original S3.

Finding out what kind of performance your benchmarks would yield on a pure
AWS setting would be interesting. I am not asking you to do that, but you
may get even better performance in that case :)

Cheers,
Seref


On Fri, Jul 18, 2025 at 11:58 AM Pierre Barre <[email protected]> wrote:

> Hi Seref,
>
> For the benchmarks, I used Hetzner's cloud service with the following
> setup:
>
> - A Hetzner s3 bucket in the FSN1 region
> - A virtual machine of type ccx63 48 vCPU 192 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB zfs L2ARC
> - Postgres configured accordingly memory-wise as well as with
> synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>
> Sorry, this was meant to go to the whole group:
>
> Very interesting!. Great work. Can you clarify how exactly you're running
> postgres in your tests? A specific AWS service? What's the test
> infrastructure that sits above the file system?
>
> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <[email protected]> wrote:
>
> Hi everyone,
>
> I wanted to share a project I've been working on that enables PostgreSQL
> to run on S3 storage while maintaining performance comparable to local
> NVMe. The approach uses block-level access rather than trying to map
> filesystem operations to S3 objects.
>
> ZeroFS: https://github.com/Barre/ZeroFS
>
> # The Architecture
>
> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage
> as raw block devices. PostgreSQL runs unmodified on ZFS pools built on
> these block devices:
>
> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>
> By providing block-level access and leveraging ZFS's caching capabilities
> (L2ARC), we can achieve microsecond latencies despite the underlying
> storage being in S3.
>
> ## Performance Results
>
> Here are pgbench results from PostgreSQL running on this setup:
>
> ### Read/Write Workload
>
> ```
> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
> starting vacuum...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 50
> number of threads: 15
> maximum number of tries: 1
> number of transactions per client: 100000
> number of transactions actually processed: 5000000/5000000
> number of failed transactions: 0 (0.000%)
> latency average = 0.943 ms
> initial connection time = 48.043 ms
> tps = 53041.006947 (without initial connection time)
> ```
>
> ### Read-Only Workload
>
> ```
> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S
> example
> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
> starting vacuum...end.
> transaction type: <builtin: select only>
> scaling factor: 50
> query mode: simple
> number of clients: 50
> number of threads: 15
> maximum number of tries: 1
> number of transactions per client: 100000
> number of transactions actually processed: 5000000/5000000
> number of failed transactions: 0 (0.000%)
> latency average = 0.121 ms
> initial connection time = 53.358 ms
> tps = 413436.248089 (without initial connection time)
> ```
>
> These numbers are with 50 concurrent clients and the actual data stored in
> S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while
> cold data comes from S3.
>
> ## How It Works
>
> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can
> use like any other block device
> 2. Multiple cache layers hide S3 latency:
>    a. ZFS ARC/L2ARC for frequently accessed blocks
>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD
> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block
> device
>    c. Optional local disk cache
> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>
> ## Geo-Distributed PostgreSQL
>
> Since each region can run its own ZeroFS instance, you can create
> geographically distributed PostgreSQL setups.
>
> Example architectures:
>
> Architecture 1
>
>
>                          PostgreSQL Client
>                                    |
>                                    | SQL queries
>                                    |
>                             +--------------+
>                             |  PG Proxy    |
>                             | (HAProxy/    |
>                             |  PgBouncer)  |
>                             +--------------+
>                                /        \
>                               /          \
>                    Synchronous            Synchronous
>                    Replication            Replication
>                             /              \
>                            /                \
>               +---------------+        +---------------+
>               | PostgreSQL 1  |        | PostgreSQL 2  |
>               | (Primary)     |◄------►| (Standby)     |
>               +---------------+        +---------------+
>                       |                        |
>                       |  POSIX filesystem ops  |
>                       |                        |
>               +---------------+        +---------------+
>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>               | (3-way mirror)|        | (3-way mirror)|
>               +---------------+        +---------------+
>                /      |      \          /      |      \
>               /       |       \        /       |       \
>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
>              |        |        |           |        |        |
>         +--------++--------++--------++--------++--------++--------+
>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>         +--------++--------++--------++--------++--------++--------+
>              |         |         |         |         |         |
>              |         |         |         |         |         |
>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>
> Architecture 2:
>
> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>                 \                    /
>                  \                  /
>                   Same ZFS Pool (NBD)
>                          |
>                   6 Global ZeroFS
>                          |
>                       S3 Regions
>
>
> The main advantages I see are:
> 1. Dramatic cost reduction for large datasets
> 2. Simplified geo-distribution
> 3. Infinite storage capacity
> 4. Built-in encryption and compression
>
> Looking forward to your feedback and questions!
>
> Best,
> Pierre
>
> P.S. The full project includes a custom NFS filesystem too.
>
>
>


^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-18 13:11  Pierre Barre <[email protected]>
  parent: Seref Arikan <[email protected]>
  0 siblings, 0 replies; 13+ messages in thread

From: Pierre Barre @ 2025-07-18 13:11 UTC (permalink / raw)
  To: Seref Arikan <[email protected]>; +Cc: [email protected]

> The interesting thing is, a few searches about the performance return mostly negative impressions about their object storage in comparison to the original S3. 
I think they had a rough start, but it's quite good now from what I've experienced. It's also dirt-cheap, and they don't bill for operations. So if you run ZeroFS on that you only pay for raw storage at €4.99 a month.

Combine that with their dirt cheap dedicated servers, https://www.hetzner.com/dedicated-rootserver/matrix-ax/ you can have a <€50 a month multi-terabytes postgres database

I'm dreaming of running https://www.merklemap.com/ on such a setup, but it's too early yet :)

> Finding out what kind of performance your benchmarks would yield on a pure AWS setting would be interesting. I am not asking you to do that, but you may get even better performance in that case :) 

Yes, I need to try that!

Best,
Pierre

On Fri, Jul 18, 2025, at 14:55, Seref Arikan wrote:
> Thanks, I learned something else: I didn't know Hetzner offered S3 compatible storage. 
> 
> The interesting thing is, a few searches about the performance return mostly negative impressions about their object storage in comparison to the original S3. 
> 
> Finding out what kind of performance your benchmarks would yield on a pure AWS setting would be interesting. I am not asking you to do that, but you may get even better performance in that case :) 
> 
> Cheers,
> Seref
> 
> 
> On Fri, Jul 18, 2025 at 11:58 AM Pierre Barre <[email protected]> wrote:
>> __
>> Hi Seref,
>> 
>> For the benchmarks, I used Hetzner's cloud service with the following setup:
>> 
>> - A Hetzner s3 bucket in the FSN1 region
>> - A virtual machine of type ccx63 48 vCPU 192 GB memory
>> - 3 ZeroFS nbd devices (same s3 bucket)
>> - A ZFS stripped pool with the 3 devices
>> - 200GB zfs L2ARC
>> - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>> 
>> Best,
>> Pierre
>> 
>> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>>> Sorry, this was meant to go to the whole group:
>>> 
>>> Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?
>>> 
>>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <[email protected]> wrote:
>>>> Hi everyone,
>>>> 
>>>> I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.
>>>> 
>>>> ZeroFS: https://github.com/Barre/ZeroFS
>>>> 
>>>> # The Architecture
>>>> 
>>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:
>>>> 
>>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>> 
>>>> By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.
>>>> 
>>>> ## Performance Results
>>>> 
>>>> Here are pgbench results from PostgreSQL running on this setup:
>>>> 
>>>> ### Read/Write Workload
>>>> 
>>>> ```
>>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> starting vacuum...end.
>>>> transaction type: <builtin: TPC-B (sort of)>
>>>> scaling factor: 50
>>>> query mode: simple
>>>> number of clients: 50
>>>> number of threads: 15
>>>> maximum number of tries: 1
>>>> number of transactions per client: 100000
>>>> number of transactions actually processed: 5000000/5000000
>>>> number of failed transactions: 0 (0.000%)
>>>> latency average = 0.943 ms
>>>> initial connection time = 48.043 ms
>>>> tps = 53041.006947 (without initial connection time)
>>>> ```
>>>> 
>>>> ### Read-Only Workload
>>>> 
>>>> ```
>>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> starting vacuum...end.
>>>> transaction type: <builtin: select only>
>>>> scaling factor: 50
>>>> query mode: simple
>>>> number of clients: 50
>>>> number of threads: 15
>>>> maximum number of tries: 1
>>>> number of transactions per client: 100000
>>>> number of transactions actually processed: 5000000/5000000
>>>> number of failed transactions: 0 (0.000%)
>>>> latency average = 0.121 ms
>>>> initial connection time = 53.358 ms
>>>> tps = 413436.248089 (without initial connection time)
>>>> ```
>>>> 
>>>> These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.
>>>> 
>>>> ## How It Works
>>>> 
>>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>>> 2. Multiple cache layers hide S3 latency:
>>>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>>>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>>>    c. Optional local disk cache
>>>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>>>> 
>>>> ## Geo-Distributed PostgreSQL
>>>> 
>>>> Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.
>>>> 
>>>> Example architectures:
>>>> 
>>>> Architecture 1
>>>> 
>>>> 
>>>>                          PostgreSQL Client
>>>>                                    |
>>>>                                    | SQL queries
>>>>                                    |
>>>>                             +--------------+
>>>>                             |  PG Proxy    |
>>>>                             | (HAProxy/    |
>>>>                             |  PgBouncer)  |
>>>>                             +--------------+
>>>>                                /        \
>>>>                               /          \
>>>>                    Synchronous            Synchronous
>>>>                    Replication            Replication
>>>>                             /              \
>>>>                            /                \
>>>>               +---------------+        +---------------+
>>>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>>>>               | (Primary)     |◄------►| (Standby)     |
>>>>               +---------------+        +---------------+
>>>>                       |                        |
>>>>                       |  POSIX filesystem ops  |
>>>>                       |                        |
>>>>               +---------------+        +---------------+
>>>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>>>>               | (3-way mirror)|        | (3-way mirror)|
>>>>               +---------------+        +---------------+
>>>>                /      |      \          /      |      \
>>>>               /       |       \        /       |       \
>>>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
>>>>              |        |        |           |        |        |
>>>>         +--------++--------++--------++--------++--------++--------+
>>>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>>>         +--------++--------++--------++--------++--------++--------+
>>>>              |         |         |         |         |         |
>>>>              |         |         |         |         |         |
>>>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>> 
>>>> Architecture 2:
>>>> 
>>>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>>>                 \                    /
>>>>                  \                  /
>>>>                   Same ZFS Pool (NBD)
>>>>                          |
>>>>                   6 Global ZeroFS
>>>>                          |
>>>>                       S3 Regions
>>>> 
>>>> 
>>>> The main advantages I see are:
>>>> 1. Dramatic cost reduction for large datasets
>>>> 2. Simplified geo-distribution
>>>> 3. Infinite storage capacity
>>>> 4. Built-in encryption and compression
>>>> 
>>>> Looking forward to your feedback and questions!
>>>> 
>>>> Best,
>>>> Pierre
>>>> 
>>>> P.S. The full project includes a custom NFS filesystem too.
>>>> 
>> 


^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-24 19:44  Nico Williams <[email protected]>
  parent: Pierre Barre <[email protected]>
  2 siblings, 1 reply; 13+ messages in thread

From: Nico Williams @ 2025-07-24 19:44 UTC (permalink / raw)
  To: Pierre Barre <[email protected]>; +Cc: Seref Arikan <[email protected]>; [email protected]

On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
> - Postgres configured accordingly memory-wise as well as with
>   synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
it's not safe _unless_ you have a local, fast, persistent ZIL device
(which I assume you don't).

Nico
-- 






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-24 19:50  Pierre Barre <[email protected]>
  parent: Nico Williams <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Pierre Barre @ 2025-07-24 19:50 UTC (permalink / raw)
  To: Nico Williams <[email protected]>; +Cc: Seref Arikan <[email protected]>; [email protected]

It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.

Best,
Pierre

On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>> - Postgres configured accordingly memory-wise as well as with
>>   synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>
> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
> it's not safe _unless_ you have a local, fast, persistent ZIL device
> (which I assume you don't).
>
> Nico
> --






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-24 22:03  Jeff Ross <[email protected]>
  parent: Pierre Barre <[email protected]>
  0 siblings, 2 replies; 13+ messages in thread

From: Jeff Ross @ 2025-07-24 22:03 UTC (permalink / raw)
  To: [email protected]

On 7/24/25 13:50, Pierre Barre wrote:

> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.
>
> Best,
> Pierre
>
> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>> - Postgres configured accordingly memory-wise as well as with
>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>> (which I assume you don't).
>>
>> Nico
>> --
This then begs the obvious question of how fast is this with 
synchronous_commit = on?






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-24 22:44  Pierre Barre <[email protected]>
  parent: Jeff Ross <[email protected]>
  1 sibling, 1 reply; 13+ messages in thread

From: Pierre Barre @ 2025-07-24 22:44 UTC (permalink / raw)
  To: Jeff Ross <[email protected]>; [email protected]

> This then begs the obvious question of how fast is this with 
> synchronous_commit = on?

Probably not awful, especially with commit_delay.

I'll try that and report back.

Best,
Pierre

On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:
> On 7/24/25 13:50, Pierre Barre wrote:
>
>> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.
>>
>> Best,
>> Pierre
>>
>> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>>> - Postgres configured accordingly memory-wise as well as with
>>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>>> (which I assume you don't).
>>>
>>> Nico
>>> --
> This then begs the obvious question of how fast is this with 
> synchronous_commit = on?






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-25 09:25  Pierre Barre <[email protected]>
  parent: Pierre Barre <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Pierre Barre @ 2025-07-25 09:25 UTC (permalink / raw)
  To: Jeff Ross <[email protected]>; [email protected]

Hi,

I went ahead and did that test.

Here is the postgresql config I used for reference (note the wal options (recycle, init_zero) as well as full_page_writes = off, because ZeroFS cannot have torn writes by design).

https://gist.github.com/Barre/8d68f0d00446389998a31f4e60f3276d

Test was running on Azure with Standard D16ads v5 (16 vcpus, 64 GiB memory)

This time, I didn't run ZFS with L2ARC, I just mounted ZeroFS with 9p.

synchronous_commit = off 

postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 6.239 ms
initial connection time = 68.922 ms
tps = 16026.940646 (without initial connection time)

synchronous_commit = on

postgres@zerofs:~$ pgbench -vvv -c 50 -j 15 -t 1000 bench
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 50000/50000
number of failed transactions: 0 (0.000%)
latency average = 197.723 ms
initial connection time = 46.089 ms
tps = 252.878721 (without initial connection time)

Not great barebones with with synchronous_commit, but still usable!

Best,
Pierre

On Fri, Jul 25, 2025, at 00:44, Pierre Barre wrote:
>> This then begs the obvious question of how fast is this with 
>> synchronous_commit = on?
>
> Probably not awful, especially with commit_delay.
>
> I'll try that and report back.
>
> Best,
> Pierre
>
> On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:
>> On 7/24/25 13:50, Pierre Barre wrote:
>>
>>> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.
>>>
>>> Best,
>>> Pierre
>>>
>>> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>>>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>>>> - Postgres configured accordingly memory-wise as well as with
>>>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>>>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>>>> (which I assume you don't).
>>>>
>>>> Nico
>>>> --
>> This then begs the obvious question of how fast is this with 
>> synchronous_commit = on?

^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-26 01:16  Pierre Barre <[email protected]>
  parent: Pierre Barre <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Pierre Barre @ 2025-07-26 01:16 UTC (permalink / raw)
  To: Jeff Ross <[email protected]>; [email protected]

I built postgres (same version, 16.9) but --with-block-size=32 (I'd really love if this would be a initdb time flag!) and did some more testing:

synchronous_commit = off

postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 10000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 5.727 ms
initial connection time = 59.223 ms
tps = 17460.128835 (without initial connection time)

synchronous_commit = on 

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 301.800 ms
initial connection time = 62.237 ms
tps = 331.345391 (without initial connection time)

=====================================

Then, using the same setup (same server, same postgres build), I create a ZeroFS NBD device with ext4 on top

/dev/nbd0 on /mnt_9p type ext4 (rw,relatime,stripe=32)

synchronous_commit = off

postgres@zerofs:/mnt_9p$ pgbench -vvv -c 100 -j 40 -t 10000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 3.615 ms
initial connection time = 45.653 ms
tps = 27665.373366 (without initial connection time)

synchronous_commit = on

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 337.762 ms
initial connection time = 43.969 ms
tps = 296.066616 (without initial connection time)

Best,
Pierre

On Fri, Jul 25, 2025, at 11:25, Pierre Barre wrote:
> Hi,
>
> I went ahead and did that test.
>
> Here is the postgresql config I used for reference (note the wal 
> options (recycle, init_zero) as well as full_page_writes = off, because 
> ZeroFS cannot have torn writes by design).
>
> https://gist.github.com/Barre/8d68f0d00446389998a31f4e60f3276d
>
> Test was running on Azure with Standard D16ads v5 (16 vcpus, 64 GiB memory)
>
> This time, I didn't run ZFS with L2ARC, I just mounted ZeroFS with 9p.
>
> synchronous_commit = off 
>
> postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 1000 bench
> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 100
> number of threads: 40
> maximum number of tries: 1
> number of transactions per client: 1000
> number of transactions actually processed: 100000/100000
> number of failed transactions: 0 (0.000%)
> latency average = 6.239 ms
> initial connection time = 68.922 ms
> tps = 16026.940646 (without initial connection time)
>
>
> synchronous_commit = on
>
> postgres@zerofs:~$ pgbench -vvv -c 50 -j 15 -t 1000 bench
> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 50
> number of threads: 15
> maximum number of tries: 1
> number of transactions per client: 1000
> number of transactions actually processed: 50000/50000
> number of failed transactions: 0 (0.000%)
> latency average = 197.723 ms
> initial connection time = 46.089 ms
> tps = 252.878721 (without initial connection time)
>
>
> Not great barebones with with synchronous_commit, but still usable!
>
> Best,
> Pierre
>
> On Fri, Jul 25, 2025, at 00:44, Pierre Barre wrote:
>>> This then begs the obvious question of how fast is this with 
>>> synchronous_commit = on?
>>
>> Probably not awful, especially with commit_delay.
>>
>> I'll try that and report back.
>>
>> Best,
>> Pierre
>>
>> On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:
>>> On 7/24/25 13:50, Pierre Barre wrote:
>>>
>>>> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.
>>>>
>>>> Best,
>>>> Pierre
>>>>
>>>> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>>>>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>>>>> - Postgres configured accordingly memory-wise as well as with
>>>>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>>>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>>>>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>>>>> (which I assume you don't).
>>>>>
>>>>> Nico
>>>>> --
>>> This then begs the obvious question of how fast is this with 
>>> synchronous_commit = on?

^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2025-07-26 01:22  Pierre Barre <[email protected]>
  parent: Pierre Barre <[email protected]>
  0 siblings, 0 replies; 13+ messages in thread

From: Pierre Barre @ 2025-07-26 01:22 UTC (permalink / raw)
  To: Jeff Ross <[email protected]>; [email protected]

And finally, some read only benchmarks with the same postgres build.

9P:

postgres@zerofs:/mnt_9p$ pgbench -vvv -c 100 -j 40 -t 10000 bench -S
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 0.539 ms
initial connection time = 59.157 ms
tps = 185652.686153 (without initial connection time)


ext4:

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 10000 bench -S
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 0.547 ms
initial connection time = 44.054 ms
tps = 182836.180428 (without initial connection time)

Best,
Pierre


On Sat, Jul 26, 2025, at 03:16, Pierre Barre wrote:
> I built postgres (same version, 16.9) but --with-block-size=32 (I'd 
> really love if this would be a initdb time flag!) and did some more 
> testing:
>
> synchronous_commit = off
>
> postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 10000 bench
> pgbench (16.9 (Ubuntu 16.10-1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 100
> number of threads: 40
> maximum number of tries: 1
> number of transactions per client: 10000
> number of transactions actually processed: 1000000/1000000
> number of failed transactions: 0 (0.000%)
> latency average = 5.727 ms
> initial connection time = 59.223 ms
> tps = 17460.128835 (without initial connection time)
>
> synchronous_commit = on 
>
> postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
> pgbench (16.9 (Ubuntu 16.10-1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 100
> number of threads: 40
> maximum number of tries: 1
> number of transactions per client: 1000
> number of transactions actually processed: 100000/100000
> number of failed transactions: 0 (0.000%)
> latency average = 301.800 ms
> initial connection time = 62.237 ms
> tps = 331.345391 (without initial connection time)
>
> =====================================
>
> Then, using the same setup (same server, same postgres build), I create 
> a ZeroFS NBD device with ext4 on top
>
> /dev/nbd0 on /mnt_9p type ext4 (rw,relatime,stripe=32)
>
> synchronous_commit = off
>
> postgres@zerofs:/mnt_9p$ pgbench -vvv -c 100 -j 40 -t 10000 bench
> pgbench (16.9 (Ubuntu 16.10-1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 100
> number of threads: 40
> maximum number of tries: 1
> number of transactions per client: 10000
> number of transactions actually processed: 1000000/1000000
> number of failed transactions: 0 (0.000%)
> latency average = 3.615 ms
> initial connection time = 45.653 ms
> tps = 27665.373366 (without initial connection time)
>
> synchronous_commit = on
>
> postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
> pgbench (16.9 (Ubuntu 16.10-1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 100
> number of threads: 40
> maximum number of tries: 1
> number of transactions per client: 1000
> number of transactions actually processed: 100000/100000
> number of failed transactions: 0 (0.000%)
> latency average = 337.762 ms
> initial connection time = 43.969 ms
> tps = 296.066616 (without initial connection time)
>
> Best,
> Pierre
>
>
> On Fri, Jul 25, 2025, at 11:25, Pierre Barre wrote:
>> Hi,
>>
>> I went ahead and did that test.
>>
>> Here is the postgresql config I used for reference (note the wal 
>> options (recycle, init_zero) as well as full_page_writes = off, because 
>> ZeroFS cannot have torn writes by design).
>>
>> https://gist.github.com/Barre/8d68f0d00446389998a31f4e60f3276d
>>
>> Test was running on Azure with Standard D16ads v5 (16 vcpus, 64 GiB memory)
>>
>> This time, I didn't run ZFS with L2ARC, I just mounted ZeroFS with 9p.
>>
>> synchronous_commit = off 
>>
>> postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 1000 bench
>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> starting vacuum...end.
>> starting vacuum pgbench_accounts...end.
>> transaction type: <builtin: TPC-B (sort of)>
>> scaling factor: 50
>> query mode: simple
>> number of clients: 100
>> number of threads: 40
>> maximum number of tries: 1
>> number of transactions per client: 1000
>> number of transactions actually processed: 100000/100000
>> number of failed transactions: 0 (0.000%)
>> latency average = 6.239 ms
>> initial connection time = 68.922 ms
>> tps = 16026.940646 (without initial connection time)
>>
>>
>> synchronous_commit = on
>>
>> postgres@zerofs:~$ pgbench -vvv -c 50 -j 15 -t 1000 bench
>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> starting vacuum...end.
>> starting vacuum pgbench_accounts...end.
>> transaction type: <builtin: TPC-B (sort of)>
>> scaling factor: 50
>> query mode: simple
>> number of clients: 50
>> number of threads: 15
>> maximum number of tries: 1
>> number of transactions per client: 1000
>> number of transactions actually processed: 50000/50000
>> number of failed transactions: 0 (0.000%)
>> latency average = 197.723 ms
>> initial connection time = 46.089 ms
>> tps = 252.878721 (without initial connection time)
>>
>>
>> Not great barebones with with synchronous_commit, but still usable!
>>
>> Best,
>> Pierre
>>
>> On Fri, Jul 25, 2025, at 00:44, Pierre Barre wrote:
>>>> This then begs the obvious question of how fast is this with 
>>>> synchronous_commit = on?
>>>
>>> Probably not awful, especially with commit_delay.
>>>
>>> I'll try that and report back.
>>>
>>> Best,
>>> Pierre
>>>
>>> On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:
>>>> On 7/24/25 13:50, Pierre Barre wrote:
>>>>
>>>>> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.
>>>>>
>>>>> Best,
>>>>> Pierre
>>>>>
>>>>> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>>>>>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>>>>>> - Postgres configured accordingly memory-wise as well as with
>>>>>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>>>>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>>>>>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>>>>>> (which I assume you don't).
>>>>>>
>>>>>> Nico
>>>>>> --
>>>> This then begs the obvious question of how fast is this with 
>>>> synchronous_commit = on?






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2026-02-16 11:06  Pierre Barre <[email protected]>
  parent: Jeff Ross <[email protected]>
  1 sibling, 1 reply; 13+ messages in thread

From: Pierre Barre @ 2026-02-16 11:06 UTC (permalink / raw)
  To: Jeff Ross <[email protected]>; [email protected]

Hi all,

Circling back on this thread, ZeroFS now supports placing its WAL on local storage (or something like S3 Express One Zone). ZeroFS wal is sub-gigabyte and just there to handle frequent syncs, it doesn't act as writeback caching.

Here are pgbench results with synchronous_commit = on, WAL on local NVMe, on a 6-core / 32GB RAM machine with a 4 Gb/s pipe:

$ pgbench -c 100 -T 100 --protocol=prepared

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: prepared
number of clients: 100
number of threads: 1
duration: 100 s
number of transactions actually processed: 1,578,675
number of failed transactions: 0 (0.000%)
latency average = 6.312 ms
tps = 15,843 (without initial connection time)

Best,
Pierre

On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:
> On 7/24/25 13:50, Pierre Barre wrote:
>
>> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.
>>
>> Best,
>> Pierre
>>
>> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>>> - Postgres configured accordingly memory-wise as well as with
>>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>>> (which I assume you don't).
>>>
>>> Nico
>>> --
> This then begs the obvious question of how fast is this with 
> synchronous_commit = on?






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
@ 2026-04-08 20:50  Pierre Barre <[email protected]>
  parent: Pierre Barre <[email protected]>
  0 siblings, 0 replies; 13+ messages in thread

From: Pierre Barre @ 2026-04-08 20:50 UTC (permalink / raw)
  To: [email protected]

Hi all,

Building on that, I made Postgres run in the browser using a x86 JavaScript emulator, with ZeroFS mounted in that vm through vsock-virtio that channels back to a 9P websocket wrapper: https://www.zerofs.net/postgresql-in-the-browser

Best,
Pierre

On Mon, Feb 16, 2026, at 12:06, Pierre Barre wrote:
> Hi all,
>
> Circling back on this thread, ZeroFS now supports placing its WAL on 
> local storage (or something like S3 Express One Zone). ZeroFS wal is 
> sub-gigabyte and just there to handle frequent syncs, it doesn't act as 
> writeback caching.
>
> Here are pgbench results with synchronous_commit = on, WAL on local 
> NVMe, on a 6-core / 32GB RAM machine with a 4 Gb/s pipe:
>
> $ pgbench -c 100 -T 100 --protocol=prepared
>
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 100
> query mode: prepared
> number of clients: 100
> number of threads: 1
> duration: 100 s
> number of transactions actually processed: 1,578,675
> number of failed transactions: 0 (0.000%)
> latency average = 6.312 ms
> tps = 15,843 (without initial connection time)
>
> Best,
> Pierre
>
> On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:
>> On 7/24/25 13:50, Pierre Barre wrote:
>>
>>> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit. Synchronous_commit don’t make your system automatically safe either, and if that’s a requirement, there’s many workarounds, as you suggested, it certainly doesn’t make the setup useless.
>>>
>>> Best,
>>> Pierre
>>>
>>> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>>>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>>>> - Postgres configured accordingly memory-wise as well as with
>>>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>>>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>>>> (which I assume you don't).
>>>>
>>>> Nico
>>>> --
>> This then begs the obvious question of how fast is this with 
>> synchronous_commit = on?






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

end of thread, other threads:[~2026-04-08 20:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-07-18 10:57 Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance Pierre Barre <[email protected]>
2025-07-18 12:31 ` Pierre Barre <[email protected]>
2025-07-18 12:55 ` Seref Arikan <[email protected]>
2025-07-18 13:11   ` Pierre Barre <[email protected]>
2025-07-24 19:44 ` Nico Williams <[email protected]>
2025-07-24 19:50   ` Pierre Barre <[email protected]>
2025-07-24 22:03     ` Jeff Ross <[email protected]>
2025-07-24 22:44       ` Pierre Barre <[email protected]>
2025-07-25 09:25         ` Pierre Barre <[email protected]>
2025-07-26 01:16           ` Pierre Barre <[email protected]>
2025-07-26 01:22             ` Pierre Barre <[email protected]>
2026-02-16 11:06       ` Pierre Barre <[email protected]>
2026-04-08 20:50         ` Pierre Barre <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox