Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uf4UE-001lQb-Cb for pgsql-general@arkaria.postgresql.org; Thu, 24 Jul 2025 22:32:27 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uf4UD-00C4MM-AV for pgsql-general@arkaria.postgresql.org; Thu, 24 Jul 2025 22:32:25 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uf4UC-00C4M9-E8 for pgsql-general@lists.postgresql.org; Thu, 24 Jul 2025 22:32:25 +0000 Received: from fout-b2-smtp.messagingengine.com ([202.12.124.145]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1uf4UA-000ceo-0b for pgsql-general@lists.postgresql.org; Thu, 24 Jul 2025 22:32:23 +0000 Received: from phl-compute-01.internal (phl-compute-01.phl.internal [10.202.2.41]) by mailfout.stl.internal (Postfix) with ESMTP id 52EA11D00089; Thu, 24 Jul 2025 18:32:20 -0400 (EDT) Received: from phl-imap-04 ([10.202.2.82]) by phl-compute-01.internal (MEProxy); Thu, 24 Jul 2025 18:32:20 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=barre.sh; h=cc :content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1753396340; x=1753482740; bh=3S2Mj8Jl59 B2NI8Gwc5hd8bwJ9rvcgmlpxXrmwwnFCA=; b=j4B+Ay63MH87I0YHjyzGMsrQz6 P+fIyYAewA7yXAX2G1kQwlfglQKIuQhBVaEIJMmQdp4+X2xiHxOVjxz7L7Admj9B WwqXrteWRjLMaFl+P3jImy7yTVmS4fCu0mtTGhZu064EtpzWxCkD2T7rBCpflFSP +DUIioDmWFuNPk/uQpB6e9O9TmpFDjq/C8vxsOtkQAPGIWQbf5Y4C0UH5zaonBz0 b+Ud2L495FMi9Wj6uiWNdzQLjnrZTkr9+kyGBk7Ot8BTId6i4bkmhikZ/l3QpNli 9vaDns5xGLIXZ379f+fQKVqNlZoFaPSu2U9xrVF7c7YU+wlggqKeS4iVnCtQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1753396340; x=1753482740; bh=3S2Mj8Jl59B2NI8Gwc5hd8bwJ9rvcgmlpxX rmwwnFCA=; b=WgyFyUSp4Rhx2XFxQ5XWd/CDgW2HHdrV9m6rc92GAZJ0RUX3J+x xMYqrVfSttS1hppyBGgjURepZbl45kjLVlSX/geBQhDwDYPeTVxLFk3gccMjMlJi 3PmaqTsh4V/jR6UnBt0h0KuMiz7gBdVBLc9XyxCtVNiAq7rz3+oLzROGE7n+4YRF Apnd67qozkz4WLRUUEf5xF8BVTm0Kc4licUUabfJa9svJKivqQzC/27d3igFu1kB dfnBRA0dwVzosRYMHDFsk59KsEqXRCpm+4PadUmfnPKnOGgNX2S+pRhviOQDS1on 2JNv/vHckcMAnek39RRpHRVBcuOz9xGaMRw== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdekudekiecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecunecujfgurhepofggfffhvffkjghfufgtsegrtderreertd ejnecuhfhrohhmpedfrfhivghrrhgvuceurghrrhgvfdcuoehpihgvrhhrvgessggrrhhr vgdrshhhqeenucggtffrrghtthgvrhhnpedtteetkeeugfejhedvvdefgffhhfejvedthe ehueduieeujeeujeekveelkeetkeenucffohhmrghinhepghhithhhuhgsrdgtohhmnecu vehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepphhivghrrh gvsegsrghrrhgvrdhshhdpnhgspghrtghpthhtohepvddpmhhouggvpehsmhhtphhouhht pdhrtghpthhtohepmhhtohhrshdvheesghhmrghilhdrtghomhdprhgtphhtthhopehpgh hsqhhlqdhgvghnvghrrghlsehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: i97614980:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id BB165B6006B; Thu, 24 Jul 2025 18:32:19 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface MIME-Version: 1.0 X-ThreadId: T89c86ea8eb4c36ce Date: Fri, 25 Jul 2025 00:31:58 +0200 From: "Pierre Barre" To: "Marco Torres" , pgsql-general@lists.postgresql.org Message-Id: <55a8a36a-eb10-44e7-adca-12c30ae1254d@app.fastmail.com> In-Reply-To: References: Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance Content-Type: multipart/alternative; boundary=9289deda90ad4de39980748b9e6b81bf List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --9289deda90ad4de39980748b9e6b81bf Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi Marco, Thanks for the kind words! > and potentially elaborate on other projects for an active/active clust= er! I applaud you. I wrote an argument there: https://github.com/Barre/ZeroFS?tab=3Dreadme-= ov-file#cap-theorem I definitely want to write a proof of concept when I get some time. Best, Pierre On Fri, Jul 25, 2025, at 00:21, Marco Torres wrote: > My humble take on this project: well done! You are opening the doors t= o work on a much-needed endeavor, decouple compute from storage, and pot= entially elaborate on other projects for an active/active cluster! I app= laud you. >=20 > On Thu, Jul 17, 2025, 4:59=E2=80=AFPM Pierre Barre w= rote: >> Hi everyone, >>=20 >> I wanted to share a project I've been working on that enables Postgre= SQL to run on S3 storage while maintaining performance comparable to loc= al NVMe. The approach uses block-level access rather than trying to map = filesystem operations to S3 objects. >>=20 >> ZeroFS: https://github.com/Barre/ZeroFS >>=20 >> # The Architecture >>=20 >> ZeroFS provides NBD (Network Block Device) servers that expose S3 sto= rage as raw block devices. PostgreSQL runs unmodified on ZFS pools built= on these block devices: >>=20 >> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >>=20 >> By providing block-level access and leveraging ZFS's caching capabili= ties (L2ARC), we can achieve microsecond latencies despite the underlyin= g storage being in S3. >>=20 >> ## Performance Results >>=20 >> Here are pgbench results from PostgreSQL running on this setup: >>=20 >> ### Read/Write Workload >>=20 >> ``` >> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 exam= ple >> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> starting vacuum...end. >> transaction type: >> scaling factor: 50 >> query mode: simple >> number of clients: 50 >> number of threads: 15 >> maximum number of tries: 1 >> number of transactions per client: 100000 >> number of transactions actually processed: 5000000/5000000 >> number of failed transactions: 0 (0.000%) >> latency average =3D 0.943 ms >> initial connection time =3D 48.043 ms >> tps =3D 53041.006947 (without initial connection time) >> ``` >>=20 >> ### Read-Only Workload >>=20 >> ``` >> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S e= xample >> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> starting vacuum...end. >> transaction type: >> scaling factor: 50 >> query mode: simple >> number of clients: 50 >> number of threads: 15 >> maximum number of tries: 1 >> number of transactions per client: 100000 >> number of transactions actually processed: 5000000/5000000 >> number of failed transactions: 0 (0.000%) >> latency average =3D 0.121 ms >> initial connection time =3D 53.358 ms >> tps =3D 413436.248089 (without initial connection time) >> ``` >>=20 >> These numbers are with 50 concurrent clients and the actual data stor= ed in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, = while cold data comes from S3. >>=20 >> ## How It Works >>=20 >> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS c= an use like any other block device >> 2. Multiple cache layers hide S3 latency: >> a. ZFS ARC/L2ARC for frequently accessed blocks >> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD= devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other bl= ock device >> c. Optional local disk cache >> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-t= ree >>=20 >> ## Geo-Distributed PostgreSQL >>=20 >> Since each region can run its own ZeroFS instance, you can create geo= graphically distributed PostgreSQL setups. >>=20 >> Example architectures: >>=20 >> Architecture 1 >>=20 >>=20 >> PostgreSQL Client >> | >> | SQL queries >> | >> +--------------+ >> | PG Proxy | >> | (HAProxy/ | >> | PgBouncer) | >> +--------------+ >> / \ >> / \ >> Synchronous Synchronous >> Replication Replication >> / \ >> / \ >> +---------------+ +---------------+ >> | PostgreSQL 1 | | PostgreSQL 2 | >> | (Primary) |=E2=97=84------=E2=96=BA| (Standby) = | >> +---------------+ +---------------+ >> | | >> | POSIX filesystem ops | >> | | >> +---------------+ +---------------+ >> | ZFS Pool 1 | | ZFS Pool 2 | >> | (3-way mirror)| | (3-way mirror)| >> +---------------+ +---------------+ >> / | \ / | \ >> / | \ / | \ >> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 >> | | | | | | >> +--------++--------++--------++--------++--------++--------+ >> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| >> +--------++--------++--------++--------++--------++--------+ >> | | | | | | >> | | | | | | >> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Reg= ion6 >> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) >>=20 >> Architecture 2: >>=20 >> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Standby (= Region 2) >> \ / >> \ / >> Same ZFS Pool (NBD) >> | >> 6 Global ZeroFS >> | >> S3 Regions >>=20 >>=20 >> The main advantages I see are: >> 1. Dramatic cost reduction for large datasets >> 2. Simplified geo-distribution=20 >> 3. Infinite storage capacity >> 4. Built-in encryption and compression >>=20 >> Looking forward to your feedback and questions! >>=20 >> Best, >> Pierre >>=20 >> P.S. The full project includes a custom NFS filesystem too. >>=20 --9289deda90ad4de39980748b9e6b81bf Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Hi Marco,

Thanks for the kind words!

and potent= ially elaborate on other projects for an active/active cluster! I applau= d you.

I wrote an argument there: <= a href=3D"https://github.com/Barre/ZeroFS?tab=3Dreadme-ov-file#cap-theor= em">https://github.com/Barre/ZeroFS?tab=3Dreadme-ov-file#cap-theorem=

I definitely want to write a proof of concept = when I get some time.

Best,
Pierre

On Fri, Jul 25, 2025, at 00:21, Marco Torres wrot= e:
My humble take on this= project: well done! You are opening the doors to work on a much-needed = endeavor, decouple compute from storage, and potentially elaborate on ot= her projects for an active/active cluster! I applaud you.

On Thu, Jul 17, 2025, 4:59=E2=80=AF= PM Pierre Barre <pierre@barre.sh> wrote:

= # The Architecture

ZeroFS provides NBD (Netwo= rk Block Device) servers that expose S3 storage as raw block devices. Po= stgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS ->= S3

By providing block-level access and lever= aging ZFS's caching capabilities (L2ARC), we can achieve microsecond lat= encies despite the underlying storage being in S3.

=
## Performance Results

Here are pgbench= results from PostgreSQL running on this setup:

### Read/Write Workload

```
pos= tgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starti= ng vacuum...end.
transaction type: <builtin: TPC-B (sort o= f)>
scaling factor: 50
query mode: simple
=
number of clients: 50
number of threads: 15
= maximum number of tries: 1
number of transactions per client:= 100000
number of transactions actually processed: 5000000/50= 00000
number of failed transactions: 0 (0.000%)
la= tency average =3D 0.943 ms
initial connection time =3D 48.043= ms
tps =3D 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbe= nch -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16= .9-0ubuntu0.24.04.1))
starting vacuum...end.
trans= action type: <builtin: select only>
scaling factor: 50<= /div>
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
= number of transactions per client: 100000
number of transact= ions actually processed: 5000000/5000000
number of failed tra= nsactions: 0 (0.000%)
latency average =3D 0.121 ms
= initial connection time =3D 53.358 ms
tps =3D 413436.248089 = (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data sto= red in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches,= while cold data comes from S3.

## How It Wor= ks

1. ZeroFS exposes NBD devices (e.g., /dev/= nbd0) that PostgreSQL/ZFS can use like any other block device
= 2. Multiple cache layers hide S3 latency:
   a. ZF= S ARC/L2ARC for frequently accessed blocks
   b. Ze= roFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e= .g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device<= /div>
   c. Optional local disk cache
3. All d= ata is encrypted (ChaCha20-Poly1305) before hitting S3
4. Fil= es are split into 128KB chunks for insertion into ZeroFS' LSM-tree
=

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create= geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1
=


         =                PostgreSQL Clien= t
                &nb= sp;                  |
                  &n= bsp;                | SQL querie= s
                &nb= sp;                  |
                  &n= bsp;         +--------------+
  &nbs= p;                    =     |  PG Proxy    |
    =                     &n= bsp;   | (HAProxy/    |
      &= nbsp;                   &nb= sp; |  PgBouncer)  |
        &n= bsp;                   +---= -----------+
             =                  /  &= nbsp;     \
          &nbs= p;                   / = ;         \
        &= nbsp;          Synchronous      =       Synchronous
        =            Replication     =       Replication
       =                     /=               \
  &nb= sp;                    = ;    /                = \
              +---------= ------+        +---------------+
  &= nbsp;           | PostgreSQL 1  |  &n= bsp;     | PostgreSQL 2  |
     = ;         | (Primary)     |=E2=97=84-= -----=E2=96=BA| (Standby)     |
    =           +---------------+     =   +---------------+
          =             |        &= nbsp;               |
&nbs= p;                    = |  POSIX filesystem ops  |
      &n= bsp;               |    &nb= sp;                   |
              +--------------= -+        +---------------+
   =           |   ZFS Pool 1  |&nbs= p;       |   ZFS Pool 2  |
&nbs= p;             | (3-way mirror)|  &nb= sp;     | (3-way mirror)|
      &nbs= p;       +---------------+        +--= -------------+
            &nbs= p;  /      |      \    &nbs= p;     /      |      \
              /     =  |       \        /  =      |       \
  &nbs= p;     NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813= NBD:10814
             |&= nbsp;       |        |    &= nbsp;      |        |    &n= bsp;   |
        +--------++--------= ++--------++--------++--------++--------+
     = ;   |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
        +--------++--------++--------++----= ----++--------++--------+
          =    |         |      &n= bsp;  |         |      &nbs= p;  |         |
    &= nbsp;        |         |&nb= sp;        |         | = ;        |         |
<= div>         S3-Region1 S3-Region2 S3-Region3 S3-Reg= ion4 S3-Region5 S3-Region6
        (us-ea= st) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

<= /div>
Architecture 2:

PostgreSQL Primary= (Region 1) =E2=86=90=E2=86=92 PostgreSQL Standby (Region 2)
=                 \    &= nbsp;               /
&nbs= p;                \   =               /
  &n= bsp;               Same ZFS Pool (NBD= )
                &nb= sp;        |
        =           6 Global ZeroFS
  &nb= sp;                    = ;  |
              &n= bsp;       S3 Regions


The main advantages I see are:
1. Dramatic cost reduc= tion for large datasets
2. Simplified geo-distribution
=
3. Infinite storage capacity
4. Built-in encryption and= compression

Looking forward to your feedback= and questions!

Best,
Pierre
=

P.S. The full project includes a custom NFS filesy= stem too.


<= /div> --9289deda90ad4de39980748b9e6b81bf--