Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ucksh-00Gi3z-VF for pgsql-general@arkaria.postgresql.org; Fri, 18 Jul 2025 13:12:08 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ucksf-00CJq6-DQ for pgsql-general@arkaria.postgresql.org; Fri, 18 Jul 2025 13:12:06 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uckse-00CJpx-Me for pgsql-general@lists.postgresql.org; Fri, 18 Jul 2025 13:12:05 +0000 Received: from fout-b8-smtp.messagingengine.com ([202.12.124.151]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1ucksc-007wzl-2u for pgsql-general@lists.postgresql.org; Fri, 18 Jul 2025 13:12:04 +0000 Received: from phl-compute-01.internal (phl-compute-01.phl.internal [10.202.2.41]) by mailfout.stl.internal (Postfix) with ESMTP id 8A6E31D000D0; Fri, 18 Jul 2025 09:12:01 -0400 (EDT) Received: from phl-imap-04 ([10.202.2.82]) by phl-compute-01.internal (MEProxy); Fri, 18 Jul 2025 09:12:01 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=barre.sh; h=cc :cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1752844321; x=1752930721; bh=r7/T6/rsHp NND1nDlGrCAHXVyomxdescCWwzIpVXo64=; b=XrQToKSw6tn0CP8EE7FSIy4yfX 6tVsp2P1RHgGgBpKvxAC7ORVPWFJS/JKzTieR7f2ejsCPC4LGvLSwt9Jvtpuf7JR MpuqxaV2VVS+hdVU0WCSY9jenTMWxW5bGLCG96YKz3OnGjW9odHg/IyqWKQVLt3k nBZuXxhLBuuHSRvgVKb4lHgs/qwJOcIAZDVnAhI/CwSCHPNnT2YxL1u9o1KVUwwd YpRPo+j0mQaHEEO6tRX47BMwgdvgh0YOiZ4pAN/BNkWP3XKO24jC/yOSP8VCHNdx KirszoOeev4szfQ38xEkawZVvDEvnaJZftlhNs0rhMBXKhj5WTAas2RkAkBA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1752844321; x=1752930721; bh=r7/T6/rsHpNND1nDlGrCAHXVyomxdescCWw zIpVXo64=; b=MEm/Z2ymrMSOLo4YvUcdU28Aruod5RSG5MwJcJ+PVifGyOul0/3 MBcSQlO6BscJFxrRHDMYdqvCXAyVZ7g22hOAluFe3AjqeXDkxtVnNaC7R0Kf9xzY YU65sTXwWV0Jxo+l6US5QqLmhOp3jHhfOF5jL/uvonjht1B18FvKEIXVYnInf7oG 5iIagUmqqHibT4Xiel78C8vwH3wY9VDtVFv5AuTULXfhg0+0x10wWwWmGuaPccLk GloE8ABCuSEJO323rZ42iysR+IKc0geqwr4KmHl92o/Q93u4sXfvRK3gZU6NpS2i wVks+RW6fgNyvA6uonA3yvREbdFHCP4iqNg== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdeifeehgecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefoggffhffvvefkjghfufgtsegrtderreertdejnecuhfhrohhmpedfrfhivghrrhgv uceurghrrhgvfdcuoehpihgvrhhrvgessggrrhhrvgdrshhhqeenucggtffrrghtthgvrh hnpeegtdeffeeuleetteejfeetieekfeefteeitdevgeffheffvdekvdefieffudfhieen ucffohhmrghinhephhgvthiinhgvrhdrtghomhdpmhgvrhhklhgvmhgrphdrtghomhdpgh hithhhuhgsrdgtohhmnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghi lhhfrhhomhepphhivghrrhgvsegsrghrrhgvrdhshhdpnhgspghrtghpthhtohepvddpmh houggvpehsmhhtphhouhhtpdhrtghpthhtohepshgvrhgvfhgrrhhikhgrnhesghhmrghi lhdrtghomhdprhgtphhtthhopehpghhsqhhlqdhgvghnvghrrghlsehlihhsthhsrdhpoh hsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: i97614980:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id BC094B6006B; Fri, 18 Jul 2025 09:12:00 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface MIME-Version: 1.0 X-ThreadId: T89c86ea8eb4c36ce Date: Fri, 18 Jul 2025 15:11:40 +0200 From: "Pierre Barre" To: "Seref Arikan" Cc: pgsql-general@lists.postgresql.org Message-Id: <350749d2-55ad-4566-bad2-93188fd23b7c@app.fastmail.com> In-Reply-To: References: <8188513c-e089-4273-b2be-16dd0a5a0a80@app.fastmail.com> Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance Content-Type: multipart/alternative; boundary=274bc72b2ca74c3e8da2706a01ed3761 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --274bc72b2ca74c3e8da2706a01ed3761 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > The interesting thing is, a few searches about the performance return = mostly negative impressions about their object storage in comparison to = the original S3.=20 I think they had a rough start, but it's quite good now from what I've e= xperienced. It's also dirt-cheap, and they don't bill for operations. So= if you run ZeroFS on that you only pay for raw storage at =E2=82=AC4.99= a month. Combine that with their dirt cheap dedicated servers, https://www.hetzne= r.com/dedicated-rootserver/matrix-ax/ you can have a <=E2=82=AC50 a mont= h multi-terabytes postgres database I'm dreaming of running https://www.merklemap.com/ on such a setup, but = it's too early yet :) > Finding out what kind of performance your benchmarks would yield on a = pure AWS setting would be interesting. I am not asking you to do that, b= ut you may get even better performance in that case :)=20 Yes, I need to try that! Best, Pierre On Fri, Jul 18, 2025, at 14:55, Seref Arikan wrote: > Thanks, I learned something else: I didn't know Hetzner offered S3 com= patible storage.=20 >=20 > The interesting thing is, a few searches about the performance return = mostly negative impressions about their object storage in comparison to = the original S3.=20 >=20 > Finding out what kind of performance your benchmarks would yield on a = pure AWS setting would be interesting. I am not asking you to do that, b= ut you may get even better performance in that case :)=20 >=20 > Cheers, > Seref >=20 >=20 > On Fri, Jul 18, 2025 at 11:58=E2=80=AFAM Pierre Barre wrote: >> __ >> Hi Seref, >>=20 >> For the benchmarks, I used Hetzner's cloud service with the following= setup: >>=20 >> - A Hetzner s3 bucket in the FSN1 region >> - A virtual machine of type ccx63 48 vCPU 192 GB memory >> - 3 ZeroFS nbd devices (same s3 bucket) >> - A ZFS stripped pool with the 3 devices >> - 200GB zfs L2ARC >> - Postgres configured accordingly memory-wise as well as with synchro= nous_commit =3D off, wal_init_zero =3D off and wal_recycle =3D off. >>=20 >> Best, >> Pierre >>=20 >> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: >>> Sorry, this was meant to go to the whole group: >>>=20 >>> Very interesting!. Great work. Can you clarify how exactly you're ru= nning postgres in your tests? A specific AWS service? What's the test in= frastructure that sits above the file system? >>>=20 >>> On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre wrote: >>>> Hi everyone, >>>>=20 >>>> I wanted to share a project I've been working on that enables Postg= reSQL to run on S3 storage while maintaining performance comparable to l= ocal NVMe. The approach uses block-level access rather than trying to ma= p filesystem operations to S3 objects. >>>>=20 >>>> ZeroFS: https://github.com/Barre/ZeroFS >>>>=20 >>>> # The Architecture >>>>=20 >>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 s= torage as raw block devices. PostgreSQL runs unmodified on ZFS pools bui= lt on these block devices: >>>>=20 >>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >>>>=20 >>>> By providing block-level access and leveraging ZFS's caching capabi= lities (L2ARC), we can achieve microsecond latencies despite the underly= ing storage being in S3. >>>>=20 >>>> ## Performance Results >>>>=20 >>>> Here are pgbench results from PostgreSQL running on this setup: >>>>=20 >>>> ### Read/Write Workload >>>>=20 >>>> ``` >>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 ex= ample >>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>> starting vacuum...end. >>>> transaction type: >>>> scaling factor: 50 >>>> query mode: simple >>>> number of clients: 50 >>>> number of threads: 15 >>>> maximum number of tries: 1 >>>> number of transactions per client: 100000 >>>> number of transactions actually processed: 5000000/5000000 >>>> number of failed transactions: 0 (0.000%) >>>> latency average =3D 0.943 ms >>>> initial connection time =3D 48.043 ms >>>> tps =3D 53041.006947 (without initial connection time) >>>> ``` >>>>=20 >>>> ### Read-Only Workload >>>>=20 >>>> ``` >>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S= example >>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>> starting vacuum...end. >>>> transaction type: >>>> scaling factor: 50 >>>> query mode: simple >>>> number of clients: 50 >>>> number of threads: 15 >>>> maximum number of tries: 1 >>>> number of transactions per client: 100000 >>>> number of transactions actually processed: 5000000/5000000 >>>> number of failed transactions: 0 (0.000%) >>>> latency average =3D 0.121 ms >>>> initial connection time =3D 53.358 ms >>>> tps =3D 413436.248089 (without initial connection time) >>>> ``` >>>>=20 >>>> These numbers are with 50 concurrent clients and the actual data st= ored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches= , while cold data comes from S3. >>>>=20 >>>> ## How It Works >>>>=20 >>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS= can use like any other block device >>>> 2. Multiple cache layers hide S3 latency: >>>> a. ZFS ARC/L2ARC for frequently accessed blocks >>>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes N= BD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other = block device >>>> c. Optional local disk cache >>>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >>>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM= -tree >>>>=20 >>>> ## Geo-Distributed PostgreSQL >>>>=20 >>>> Since each region can run its own ZeroFS instance, you can create g= eographically distributed PostgreSQL setups. >>>>=20 >>>> Example architectures: >>>>=20 >>>> Architecture 1 >>>>=20 >>>>=20 >>>> PostgreSQL Client >>>> | >>>> | SQL queries >>>> | >>>> +--------------+ >>>> | PG Proxy | >>>> | (HAProxy/ | >>>> | PgBouncer) | >>>> +--------------+ >>>> / \ >>>> / \ >>>> Synchronous Synchronous >>>> Replication Replication >>>> / \ >>>> / \ >>>> +---------------+ +---------------+ >>>> | PostgreSQL 1 | | PostgreSQL 2 | >>>> | (Primary) |=E2=97=84------=E2=96=BA| (Standby) = | >>>> +---------------+ +---------------+ >>>> | | >>>> | POSIX filesystem ops | >>>> | | >>>> +---------------+ +---------------+ >>>> | ZFS Pool 1 | | ZFS Pool 2 | >>>> | (3-way mirror)| | (3-way mirror)| >>>> +---------------+ +---------------+ >>>> / | \ / | \ >>>> / | \ / | \ >>>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 >>>> | | | | | | >>>> +--------++--------++--------++--------++--------++--------+ >>>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| >>>> +--------++--------++--------++--------++--------++--------+ >>>> | | | | | | >>>> | | | | | | >>>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-R= egion6 >>>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-eas= t) >>>>=20 >>>> Architecture 2: >>>>=20 >>>> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Standby= (Region 2) >>>> \ / >>>> \ / >>>> Same ZFS Pool (NBD) >>>> | >>>> 6 Global ZeroFS >>>> | >>>> S3 Regions >>>>=20 >>>>=20 >>>> The main advantages I see are: >>>> 1. Dramatic cost reduction for large datasets >>>> 2. Simplified geo-distribution >>>> 3. Infinite storage capacity >>>> 4. Built-in encryption and compression >>>>=20 >>>> Looking forward to your feedback and questions! >>>>=20 >>>> Best, >>>> Pierre >>>>=20 >>>> P.S. The full project includes a custom NFS filesystem too. >>>>=20 >>=20 --274bc72b2ca74c3e8da2706a01ed3761 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
The interesting thing = is, a few searches about the performance return mostly negative impressi= ons about their object storage in comparison to the original S3. 
I think they had a rough start, but it's qui= te good now from what I've experienced. It's also dirt-cheap, and they d= on't bill for operations. So if you run ZeroFS on that you only pay for = raw storage at =E2=82=AC4.99 a month.

Comb= ine that with their dirt cheap dedicated servers, https://www.hetzner.c= om/dedicated-rootserver/matrix-ax/ you can have a <=E2=82=AC= 50 a month multi-terabytes postgres database

I'= m dreaming of running https:= //www.merklemap.com/ on such a setup, but it's too early yet :)=

Finding out what kind of performance your benchmarks w= ould yield on a pure AWS setting would be interesting. I am not asking y= ou to do that, but you may get even better performance in that case :)&n= bsp;

Yes, I need to try that= !

Best,
Pierre

On Fri, Jul 18, 2025, at 14:55, Seref Arikan wrote:
Thanks, I learn= ed something else: I didn't know Hetzner offered S3 compatible storage.&= nbsp;

The interesting thing is, a few searches = about the performance return mostly negative impressions about their obj= ect storage in comparison to the original S3. 

=
Finding out what kind of performance your benchmarks would yield on= a pure AWS setting would be interesting. I am not asking you to do that= , but you may get even better performance in that case :) 

Cheers,
Seref

=
On Fri, Jul 18, 2025 at 11:58=E2=80=AF= AM Pierre Barre <pierre@barre.sh> wrote:



Hi everyone,
<= br>
I wanted to share a project I've been working on that enab= les PostgreSQL to run on S3 storage while maintaining performance compar= able to local NVMe. The approach uses block-level access rather than try= ing to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (= Network Block Device) servers that expose S3 storage as raw block device= s. PostgreSQL runs unmodified on ZFS pools built on these block devices:=

PostgreSQL -> ZFS -> NBD -> ZeroFS -&= gt; S3

By providing block-level access and leve= raging ZFS's caching capabilities (L2ARC), we can achieve microsecond la= tencies despite the underlying storage being in S3.

=
## Performance Results

Here are pgbench re= sults from PostgreSQL running on this setup:

##= # Read/Write Workload

```
postgres@ub= untu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
= pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum.= ..end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number o= f clients: 50
number of threads: 15
maximum number o= f tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
nu= mber of failed transactions: 0 (0.000%)
latency average =3D 0.= 943 ms
initial connection time =3D 48.043 ms
tps =3D= 53041.006947 (without initial connection time)
```
=
### Read-Only Workload

```
=
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S= example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: selec= t only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
m= aximum number of tries: 1
number of transactions per client: 1= 00000
number of transactions actually processed: 5000000/50000= 00
number of failed transactions: 0 (0.000%)
latency= average =3D 0.121 ms
initial connection time =3D 53.358 ms
tps =3D 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clie= nts and the actual data stored in S3. Hot data is served from ZFS L2ARC = and ZeroFS's memory caches, while cold data comes from S3.
## How It Works

1. ZeroFS exposes N= BD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other = block device
2. Multiple cache layers hide S3 latency:
   a. ZFS ARC/L2ARC for frequently accessed blocks
   b. ZeroFS memory cache for metadata and hot dataZeroFS ex= poses NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any= other block device
   c. Optional local disk cache<= /div>
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3=
4. Files are split into 128KB chunks for insertion into ZeroF= S' LSM-tree

## Geo-Distributed PostgreSQL
=

Since each region can run its own ZeroFS instance, y= ou can create geographically distributed PostgreSQL setups.
Example architectures:

Architectur= e 1


        =                  PostgreSQL= Client
               = ;                    |=
                 = ;                  | SQL qu= eries
                =                    |
                  =           +--------------+
  &nb= sp;                    = ;     |  PG Proxy    |
    =                     &n= bsp;   | (HAProxy/    |
      &n= bsp;                   &nbs= p; |  PgBouncer)  |
        &nbs= p;                   +-----= ---------+
              &n= bsp;                /  &nbs= p;     \
            &= nbsp;                 /  &n= bsp;       \
         =          Synchronous       = ;     Synchronous
         =          Replication       = ;     Replication
         =                   /  =             \
    &nbs= p;                    =  /                \
<= div>              +---------------+&n= bsp;       +---------------+
    &nbs= p;         | PostgreSQL 1  |     = ;   | PostgreSQL 2  |
        &n= bsp;     | (Primary)     |=E2=97=84------=E2=96= =BA| (Standby)     |
       = ;       +---------------+        +---= ------------+
             =         |            =             |
    &nbs= p;                 |  POSIX= filesystem ops  |
          &nb= sp;           |        &nbs= p;               |
  &= nbsp;           +---------------+    =     +---------------+
        &n= bsp;     |   ZFS Pool 1  |      =   |   ZFS Pool 2  |
      &= nbsp;       | (3-way mirror)|        = | (3-way mirror)|
            &n= bsp; +---------------+        +---------------+
               /  &nb= sp;   |      \          /&n= bsp;     |      \
     = ;         /       |   =    \        /       |=        \
        NBD:1= 0809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
&= nbsp;            |      &nb= sp; |        |          &nb= sp;|        |        |
        +--------++--------++--------++--------++--= ------++--------+
        |ZeroFS 1||ZeroF= S 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
    &nb= sp;   +--------++--------++--------++--------++--------++--------+<= /div>
             |   =      |         |    &= nbsp;    |         |    &nb= sp;    |
            &= nbsp;|         |        &nb= sp;|         |         = ;|         |
      &nb= sp; S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
        (us-east) (eu-west) (ap-south) (us-w= est) (eu-north) (ap-east)

Architecture 2:
=

PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 Pos= tgreSQL Standby (Region 2)
          =       \              &= nbsp;     /
           = ;      \              =     /
            &nbs= p;     Same ZFS Pool (NBD)
      &nbs= p;                  |
=
                  6 Gl= obal ZeroFS
              &= nbsp;          |
     =                 S3 Regions


The main advantages I see are:
=
1. Dramatic cost reduction for large datasets
2. Simplifi= ed geo-distribution
3. Infinite storage capacity
4. = Built-in encryption and compression

Looking for= ward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a c= ustom NFS filesystem too.



--274bc72b2ca74c3e8da2706a01ed3761--