Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ucin1-00GBS7-0N for pgsql-general@arkaria.postgresql.org; Fri, 18 Jul 2025 10:58:07 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ucimz-00Aybw-10 for pgsql-general@arkaria.postgresql.org; Fri, 18 Jul 2025 10:58:05 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ucimy-00Aybo-JE for pgsql-general@lists.postgresql.org; Fri, 18 Jul 2025 10:58:05 +0000 Received: from fhigh-b4-smtp.messagingengine.com ([202.12.124.155]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1ucimw-008QVk-23 for pgsql-general@lists.postgresql.org; Fri, 18 Jul 2025 10:58:04 +0000 Received: from phl-compute-01.internal (phl-compute-01.phl.internal [10.202.2.41]) by mailfhigh.stl.internal (Postfix) with ESMTP id A827B7A016A; Fri, 18 Jul 2025 06:58:00 -0400 (EDT) Received: from phl-imap-04 ([10.202.2.82]) by phl-compute-01.internal (MEProxy); Fri, 18 Jul 2025 06:58:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=barre.sh; h=cc :cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1752836280; x=1752922680; bh=iayLChruz2 rsNNZU1ISMHBVGl6PYraJwExguOlRvP+E=; b=RGVevMWLSioF88CIRwYKaXgaWn qgi7+sLVoSYiEBoAIIiD9SBwA9dK1mOm1zf6Y9J3pa/U/M1R5rqSiE6rMVY7k+jN BHy8296Eu2bwSvVcC67GXHq9YLPqoVI929151YY0Nihy1+OY8ofqRN9fPqn5WecW wbiC+tC2RjAPzeuXFhA6fBKQbp48bgEtD1uqlUKc+DW741hdioKV6gY9d4dhROtN k01cuecLpP62MouW7vtoR5V9JMexqkCRQje4mh8XpHEURSuYGaFqk3fKcxo/y33J akqcXf3jBWgiyF5RLpWhuXi9u1xQ+84rGlYJmU/Kk99FQ2k1wsc05+FyQLYw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1752836280; x=1752922680; bh=iayLChruz2rsNNZU1ISMHBVGl6PYraJwExg uOlRvP+E=; b=d6ODdqj1exyoDQciZKcNhDkK+4mHq+eFi0QcfXAAkoSgBlzHPJC XvwSRAf9IKMvi25ert9W0lL3yzldrv0cZIGcGF+wrY+b9UaeTUExRsVjGYxTVkxX JlAFsxMaSjWFB54OOCvZPhlMxlIVsjr72dPBCqocJ72dKF0XnV3Bbn12ExCXO7OY PvhFOjnLZIaFoHcxLemYOvTrnm15xbwLkiyjChGIG0iqS0uHCznzao6c42enpAbU XPSaTYKiVZa2AbdOdupWbaPF0al2S47qFQIam5xgAP6DnrqCSrq0LTdXXZVfq+Vd HFr5FRBasAB8Q/8uJRfkvk4YmmWX4+9YjLA== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdeifedvkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefoggffhffvvefkjghfufgtsegrtderreertdejnecuhfhrohhmpedfrfhivghrrhgv uceurghrrhgvfdcuoehpihgvrhhrvgessggrrhhrvgdrshhhqeenucggtffrrghtthgvrh hnpefhledvffevkeehueefffevudehjeevhfffkeeffefgffdviedtheehieeigfdujeen ucffohhmrghinhepghhithhhuhgsrdgtohhmnecuvehluhhsthgvrhfuihiivgeptdenuc frrghrrghmpehmrghilhhfrhhomhepphhivghrrhgvsegsrghrrhgvrdhshhdpnhgspghr tghpthhtohepvddpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepshgvrhgvfhgrrh hikhgrnhesghhmrghilhdrtghomhdprhgtphhtthhopehpghhsqhhlqdhgvghnvghrrghl sehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: i97614980:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id 4F02AB6006B; Fri, 18 Jul 2025 06:58:00 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface MIME-Version: 1.0 X-ThreadId: T89c86ea8eb4c36ce Date: Fri, 18 Jul 2025 12:57:39 +0200 From: "Pierre Barre" To: "Seref Arikan" Cc: pgsql-general@lists.postgresql.org Message-Id: <8188513c-e089-4273-b2be-16dd0a5a0a80@app.fastmail.com> In-Reply-To: References: Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance Content-Type: multipart/alternative; boundary=db3e5ce383be4b01992d6857344a935d List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --db3e5ce383be4b01992d6857344a935d Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi Seref, For the benchmarks, I used Hetzner's cloud service with the following se= tup: - A Hetzner s3 bucket in the FSN1 region - A virtual machine of type ccx63 48 vCPU 192 GB memory - 3 ZeroFS nbd devices (same s3 bucket) - A ZFS stripped pool with the 3 devices - 200GB zfs L2ARC - Postgres configured accordingly memory-wise as well as with synchronou= s_commit =3D off, wal_init_zero =3D off and wal_recycle =3D off. Best, Pierre On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: > Sorry, this was meant to go to the whole group: >=20 > Very interesting!. Great work. Can you clarify how exactly you're runn= ing postgres in your tests? A specific AWS service? What's the test infr= astructure that sits above the file system? >=20 > On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre wrote: >> Hi everyone, >>=20 >> I wanted to share a project I've been working on that enables Postgre= SQL to run on S3 storage while maintaining performance comparable to loc= al NVMe. The approach uses block-level access rather than trying to map = filesystem operations to S3 objects. >>=20 >> ZeroFS: https://github.com/Barre/ZeroFS >>=20 >> # The Architecture >>=20 >> ZeroFS provides NBD (Network Block Device) servers that expose S3 sto= rage as raw block devices. PostgreSQL runs unmodified on ZFS pools built= on these block devices: >>=20 >> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >>=20 >> By providing block-level access and leveraging ZFS's caching capabili= ties (L2ARC), we can achieve microsecond latencies despite the underlyin= g storage being in S3. >>=20 >> ## Performance Results >>=20 >> Here are pgbench results from PostgreSQL running on this setup: >>=20 >> ### Read/Write Workload >>=20 >> ``` >> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 exam= ple >> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> starting vacuum...end. >> transaction type: >> scaling factor: 50 >> query mode: simple >> number of clients: 50 >> number of threads: 15 >> maximum number of tries: 1 >> number of transactions per client: 100000 >> number of transactions actually processed: 5000000/5000000 >> number of failed transactions: 0 (0.000%) >> latency average =3D 0.943 ms >> initial connection time =3D 48.043 ms >> tps =3D 53041.006947 (without initial connection time) >> ``` >>=20 >> ### Read-Only Workload >>=20 >> ``` >> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S e= xample >> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> starting vacuum...end. >> transaction type: >> scaling factor: 50 >> query mode: simple >> number of clients: 50 >> number of threads: 15 >> maximum number of tries: 1 >> number of transactions per client: 100000 >> number of transactions actually processed: 5000000/5000000 >> number of failed transactions: 0 (0.000%) >> latency average =3D 0.121 ms >> initial connection time =3D 53.358 ms >> tps =3D 413436.248089 (without initial connection time) >> ``` >>=20 >> These numbers are with 50 concurrent clients and the actual data stor= ed in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, = while cold data comes from S3. >>=20 >> ## How It Works >>=20 >> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS c= an use like any other block device >> 2. Multiple cache layers hide S3 latency: >> a. ZFS ARC/L2ARC for frequently accessed blocks >> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD= devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other bl= ock device >> c. Optional local disk cache >> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-t= ree >>=20 >> ## Geo-Distributed PostgreSQL >>=20 >> Since each region can run its own ZeroFS instance, you can create geo= graphically distributed PostgreSQL setups. >>=20 >> Example architectures: >>=20 >> Architecture 1 >>=20 >>=20 >> PostgreSQL Client >> | >> | SQL queries >> | >> +--------------+ >> | PG Proxy | >> | (HAProxy/ | >> | PgBouncer) | >> +--------------+ >> / \ >> / \ >> Synchronous Synchronous >> Replication Replication >> / \ >> / \ >> +---------------+ +---------------+ >> | PostgreSQL 1 | | PostgreSQL 2 | >> | (Primary) |=E2=97=84------=E2=96=BA| (Standby) = | >> +---------------+ +---------------+ >> | | >> | POSIX filesystem ops | >> | | >> +---------------+ +---------------+ >> | ZFS Pool 1 | | ZFS Pool 2 | >> | (3-way mirror)| | (3-way mirror)| >> +---------------+ +---------------+ >> / | \ / | \ >> / | \ / | \ >> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 >> | | | | | | >> +--------++--------++--------++--------++--------++--------+ >> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| >> +--------++--------++--------++--------++--------++--------+ >> | | | | | | >> | | | | | | >> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Reg= ion6 >> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) >>=20 >> Architecture 2: >>=20 >> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Standby (= Region 2) >> \ / >> \ / >> Same ZFS Pool (NBD) >> | >> 6 Global ZeroFS >> | >> S3 Regions >>=20 >>=20 >> The main advantages I see are: >> 1. Dramatic cost reduction for large datasets >> 2. Simplified geo-distribution=20 >> 3. Infinite storage capacity >> 4. Built-in encryption and compression >>=20 >> Looking forward to your feedback and questions! >>=20 >> Best, >> Pierre >>=20 >> P.S. The full project includes a custom NFS filesystem too. >>=20 --db3e5ce383be4b01992d6857344a935d Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Hi Seref,

For the benchmarks, I used Hetzner's cloud servic= e with the following setup:

- A Hetzner s3 buck= et in the FSN1 region
- A virtual machine of type ccx63 4= 8 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs = L2ARC
- Postgres configured accordingly memory-wise as well as= with synchronous_commit =3D off, wal_init_zero =3D off and wal_rec= ycle =3D off.

B= est,
Pierre<= /div>

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
Sorry, this was= meant to go to the whole group:

Very interesti= ng!. Great work. Can you clarify how exactly you're running postgres in = your tests? A specific AWS service? What's the test infrastructure = that sits above the file system?

On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre <pierre@barre.sh> wrote:
Hi everyone,

I wanted to share a project I'v= e been working on that enables PostgreSQL to run on S3 storage while mai= ntaining performance comparable to local NVMe. The approach uses block-l= evel access rather than trying to map filesystem operations to S3 object= s.


# The Architecture

=
ZeroFS provides NBD (Network Block Device) servers that expo= se S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS po= ols built on these block devices:

PostgreSQL = -> ZFS -> NBD -> ZeroFS -> S3

By = providing block-level access and leveraging ZFS's caching capabilities (= L2ARC), we can achieve microsecond latencies despite the underlying stor= age being in S3.

## Performance Results
=

Here are pgbench results from PostgreSQL running o= n this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pg= bench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.= 9-0ubuntu0.24.04.1))
starting vacuum...end.
transa= ction type: <builtin: TPC-B (sort of)>
scaling factor: = 50
query mode: simple
number of clients: 50
<= div> number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of trans= actions actually processed: 5000000/5000000
number of failed = transactions: 0 (0.000%)
latency average =3D 0.943 ms
initial connection time =3D 48.043 ms
tps =3D 53041.00694= 7 (without initial connection time)
```

=
### Read-Only Workload

```
p= ostgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S examp= le
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
s= tarting vacuum...end.
transaction type: <builtin: select o= nly>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
= maximum number of tries: 1
number of transactions per client= : 100000
number of transactions actually processed: 5000000/5= 000000
number of failed transactions: 0 (0.000%)
l= atency average =3D 0.121 ms
initial connection time =3D 53.35= 8 ms
tps =3D 413436.248089 (without initial connection time)<= /div>
```

These numbers are with 50 conc= urrent clients and the actual data stored in S3. Hot data is served from= ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. Z= eroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use = like any other block device
2. Multiple cache layers hide S3 = latency:
   a. ZFS ARC/L2ARC for frequently accesse= d blocks
   b. ZeroFS memory cache for metadata and= hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZF= S can use like any other block device
   c. Optiona= l local disk cache
3. All data is encrypted (ChaCha20-Poly130= 5) before hitting S3
4. Files are split into 128KB chunks for= insertion into ZeroFS' LSM-tree

## Geo-Distr= ibuted PostgreSQL

Since each region can run i= ts own ZeroFS instance, you can create geographically distributed Postgr= eSQL setups.

Example architectures:

Architecture 1


                   = ;      PostgreSQL Client
      =                     &n= bsp;        |
       =                     &= nbsp;      | SQL queries
      =                     &n= bsp;        |
       =                     += --------------+
            &nb= sp;               |  PG Proxy&nb= sp;   |
             =               | (HAProxy/  &nbs= p; |
                =             |  PgBouncer)  |
                  &= nbsp;         +--------------+
  &nb= sp;                    = ;        /        \
&= nbsp;                   &nb= sp;         /          \
                  =  Synchronous            Synchronous
                 =  Replication            Replication<= /div>
                 = ;           /         =     \
            &n= bsp;              /    &nbs= p;           \
      =         +---------------+       = +---------------+
            =   | PostgreSQL 1  |        | PostgreSQL 2&= nbsp; |
              | (P= rimary)     |=E2=97=84------=E2=96=BA| (Standby)  &n= bsp;  |
             = +---------------+        +---------------+
                   =   |                  =       |
          &nb= sp;           |  POSIX filesystem ops = ; |
                &= nbsp;     |              &n= bsp;         |
       = ;       +---------------+        +---= ------------+
             = ; |   ZFS Pool 1  |        |  &n= bsp;ZFS Pool 2  |
          &nb= sp;   | (3-way mirror)|        | (3-way mirror)= |
              +---------= ------+        +---------------+
  &= nbsp;            /      |&n= bsp;     \          /    &n= bsp; |      \
        &nbs= p;     /       |       = ;\        /       |   =    \
        NBD:10809 NBD:108= 10 NBD:10811  NBD:10812 NBD:10813 NBD:10814
   = ;          |        | =       |           |  =       |        |
  &n= bsp;     +--------++--------++--------++--------++--------++--= ------+
        |ZeroFS 1||ZeroFS 2||Zero= FS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
      &nbs= p; +--------++--------++--------++--------++--------++--------+
             |     =    |         |      &= nbsp;  |         |      &nb= sp;  |
             |=          |         |&n= bsp;        |         |&nbs= p;        |
        S= 3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
        (us-east) (eu-west) (ap-south) (us-west)= (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 Pos= tgreSQL Standby (Region 2)
         =       \              =       /
          &nb= sp;      \             = ;     /
            &= nbsp;     Same ZFS Pool (NBD)
      =                    |
                 = 6 Global ZeroFS
            &n= bsp;            |
   =                   S3 Regio= ns


The main advantages I see = are:
1. Dramatic cost reduction for large datasets
= 2. Simplified geo-distribution
3. Infinite storage capacity=
4. Built-in encryption and compression

=
Looking forward to your feedback and questions!

Best,
Pierre

P.S. The ful= l project includes a custom NFS filesystem too.


--db3e5ce383be4b01992d6857344a935d--