Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ufZh0-0094ix-63 for pgsql-general@arkaria.postgresql.org; Sat, 26 Jul 2025 07:51:43 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ufZgz-007jQP-AW for pgsql-general@arkaria.postgresql.org; Sat, 26 Jul 2025 07:51:41 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ufZgy-007jQE-MH for pgsql-general@lists.postgresql.org; Sat, 26 Jul 2025 07:51:41 +0000 Received: from fout-b6-smtp.messagingengine.com ([202.12.124.149]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1ufZgw-000qZ2-0b for pgsql-general@lists.postgresql.org; Sat, 26 Jul 2025 07:51:39 +0000 Received: from phl-compute-01.internal (phl-compute-01.phl.internal [10.202.2.41]) by mailfout.stl.internal (Postfix) with ESMTP id 6C69B1D000B1; Sat, 26 Jul 2025 03:51:37 -0400 (EDT) Received: from phl-imap-04 ([10.202.2.82]) by phl-compute-01.internal (MEProxy); Sat, 26 Jul 2025 03:51:37 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=barre.sh; h=cc :cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1753516297; x=1753602697; bh=nFw86e4yGe +ywCjI4DeAt+kJfwhjuZIe9LUimdzxYj0=; b=Gmes4reZ/ihD4QI5jm+x9BVBmd SEzT0J7q9Uem2h1xNFe6KqbHhYKPl1me2IdAIL6KnjbZ18G43U8uvIT276FYFgsm R37hVxV2mPsleqxgFx09E7irMxohwofNQPjHFwWeM2CG21weVJTV00a81xeNn3WY CUTtpKllrvyQud2xs3fH8nWJ9BS+9akbNzy5whYrohnj0TauPVhWKXvHtLu1WmTH 4K4sWrXBQej8X8dd5ncdBIMx4EQGRA0gZOb3XX5fuPcpgWM/SttOrtH9lGa7d3Gb ro1VUx8GotekTZaQ87ntojhaThwlFzuCVr7AyQx+cUJhIJF0XRDEJcB7/o3w== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1753516297; x=1753602697; bh=nFw86e4yGe+ywCjI4DeAt+kJfwhjuZIe9LU imdzxYj0=; b=Np3COD1tebwU2B3QB1XFcv3YBJEMqmOVd2mYZPJgjTbVCbmchW1 1qrhh74uZah5hY5aBOKVL6IX136uFolJQ5RvSUeNSEnzgwnhlQ76bSWmDVqfDx4H ZtK8IwM9MPVjUWJAAaa8SBWUJ/5Ji+TwfCuLgBPHR8xIP4tFh1NWPEub8n8FNXFS gzkgYAGgUxmi+7iIZ+a3QBgvjEiQlhP0OKHANQGCE6if8wfEymwxb1D4rZAGsXLf +AIULXovRk2uUQslDfnfDFQUXEV5L69FMHGtywqGesDK6n86qNSkgXLjyf9fU660 X0tlVOr3IPnHmc3q72D+odbfnH3CJ5HdybQ== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdekheekiecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefoggffhffvvefkjghfufgtsegrtderreertdejnecuhfhrohhmpedfrfhivghrrhgv uceurghrrhgvfdcuoehpihgvrhhrvgessggrrhhrvgdrshhhqeenucggtffrrghtthgvrh hnpefhledvffevkeehueefffevudehjeevhfffkeeffefgffdviedtheehieeigfdujeen ucffohhmrghinhepghhithhhuhgsrdgtohhmnecuvehluhhsthgvrhfuihiivgeptdenuc frrghrrghmpehmrghilhhfrhhomhepphhivghrrhgvsegsrghrrhgvrdhshhdpnhgspghr tghpthhtohepvddpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepvhhlrgguihhmih hrsegthhhurhihuhhkihhnrdgtohhmpdhrtghpthhtohepphhgshhqlhdqghgvnhgvrhgr lheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: i97614980:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id 12482B6006B; Sat, 26 Jul 2025 03:51:37 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface MIME-Version: 1.0 X-ThreadId: T89c86ea8eb4c36ce Date: Sat, 26 Jul 2025 09:51:15 +0200 From: "Pierre Barre" To: "Vladimir Churyukin" Cc: pgsql-general@lists.postgresql.org Message-Id: <44dafe90-9ad6-41ae-b9fe-bea4aaf49a59@app.fastmail.com> In-Reply-To: References: <8188513c-e089-4273-b2be-16dd0a5a0a80@app.fastmail.com> <96edd171-9cbe-466d-b3d6-04e069cee419@app.fastmail.com> Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance Content-Type: multipart/alternative; boundary=d5aa1bab362c4f96945cfcf1af96a23b List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --d5aa1bab362c4f96945cfcf1af96a23b Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Ah, by "shared storage" I mean that each node can acquire exclusivity, n= ot that they can both R/W to it at the same time. > Some pretty well-known cases of storage / compute separation (Aurora, = Neon) also share the storage between instances, That model is cool, but I think it's more of a solution for outliers as = I was suggesting, not something that most would or should want. Best, Pierre On Sat, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote: > Sorry, I was referring to this: >=20 > > But when PostgreSQL instances share storage rather than replicate: > > - Consistency seems maintained (same data) > > - Availability seems maintained (client can always promote an access= ible node) > > - Partitions between PostgreSQL nodes don't prevent the system from = functioning >=20 > Some pretty well-known cases of storage / compute separation (Aurora, = Neon) also share the storage between instances, > that's why I'm a bit confused by your reply. I thought you're thinking= about this approach too, that's why I mentioned what kind of challenges= one may have on that path. >=20 >=20 > On Sat, Jul 26, 2025 at 12:36=E2=80=AFAM Pierre Barre wrote: >> __ >> What you describe doesn=E2=80=99t look like something very useful for= the vast majority of projects that needs a database. Why would you even= want that if you can avoid it?=20 >>=20 >> If your =E2=80=9Csingle node=E2=80=9D can handle tens / hundreds of t= housands requests per second, still have very durable and highly availab= le storage, as well as fast recovery mechanisms, what=E2=80=99s the poin= t? >>=20 >> I am not trying to cater to extreme outliers that may want very weird= like this, that=E2=80=99s just not the use-cases I want to address, bec= ause I believe they are few and far between. >>=20 >> Best, >> Pierre=20 >>=20 >> On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote: >>> A shared storage would require a lot of extra work. That's essential= ly what AWS Aurora does. >>> You will have to have functionality to sync in-memory states between= nodes, because all the instances will have cached data that can easily = become stale on any write operation. >>> That alone is not that simple. You will have to modify some locking = logic. Most likely do a lot of other changes in a lot of places, Postgre= s was not just built with the assumption that the storage can be shared. >>>=20 >>> -Vladimir >>>=20 >>> On Fri, Jul 18, 2025 at 5:31=E2=80=AFAM Pierre Barre wrote: >>>> Now, I'm trying to understand how CAP theorem applies here. Traditi= onal PostgreSQL replication has clear CAP trade-offs - you choose betwee= n consistency and availability during partitions. >>>>=20 >>>> But when PostgreSQL instances share storage rather than replicate: >>>> - Consistency seems maintained (same data) >>>> - Availability seems maintained (client can always promote an acces= sible node) >>>> - Partitions between PostgreSQL nodes don't prevent the system from= functioning >>>>=20 >>>> It seems that CAP assumes specific implementation details (like nod= es maintaining independent state) without explicitly stating them. >>>>=20 >>>> How should we think about CAP theorem when distributed nodes share = storage rather than coordinate state? Are the trade-offs simply moved to= a different layer, or does shared storage fundamentally change the anal= ysis? >>>>=20 >>>> Client with awareness of both PostgreSQL nodes >>>> | | >>>> =E2=86=93 (partition here) =E2=86=93 >>>> PostgreSQL Primary PostgreSQL Standby >>>> | | >>>> =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=98 >>>> =E2=86=93 >>>> Shared ZFS Pool >>>> | >>>> 6 Global ZeroFS instances >>>>=20 >>>> Best, >>>> Pierre >>>>=20 >>>> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote: >>>> > Hi Seref, >>>> > >>>> > For the benchmarks, I used Hetzner's cloud service with the follo= wing setup: >>>> > >>>> > - A Hetzner s3 bucket in the FSN1 region >>>> > - A virtual machine of type ccx63 48 vCPU 192 GB memory >>>> > - 3 ZeroFS nbd devices (same s3 bucket) >>>> > - A ZFS stripped pool with the 3 devices >>>> > - 200GB zfs L2ARC >>>> > - Postgres configured accordingly memory-wise as well as with syn= chronous_commit =3D off, wal_init_zero =3D off and wal_recycle =3D off. >>>> > >>>> > Best, >>>> > Pierre >>>> > >>>> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: >>>> >> Sorry, this was meant to go to the whole group: >>>> >> >>>> >> Very interesting!. Great work. Can you clarify how exactly you'r= e running postgres in your tests? A specific AWS service? What's the tes= t infrastructure that sits above the file system? >>>> >> >>>> >> On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre wrote: >>>> >>> Hi everyone, >>>> >>> >>>> >>> I wanted to share a project I've been working on that enables P= ostgreSQL to run on S3 storage while maintaining performance comparable = to local NVMe. The approach uses block-level access rather than trying t= o map filesystem operations to S3 objects. >>>> >>> >>>> >>> ZeroFS: https://github.com/Barre/ZeroFS >>>> >>> >>>> >>> # The Architecture >>>> >>> >>>> >>> ZeroFS provides NBD (Network Block Device) servers that expose = S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools= built on these block devices: >>>> >>> >>>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >>>> >>> >>>> >>> By providing block-level access and leveraging ZFS's caching ca= pabilities (L2ARC), we can achieve microsecond latencies despite the und= erlying storage being in S3. >>>> >>> >>>> >>> ## Performance Results >>>> >>> >>>> >>> Here are pgbench results from PostgreSQL running on this setup: >>>> >>> >>>> >>> ### Read/Write Workload >>>> >>> >>>> >>> ``` >>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10000= 0 example >>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>> >>> starting vacuum...end. >>>> >>> transaction type: >>>> >>> scaling factor: 50 >>>> >>> query mode: simple >>>> >>> number of clients: 50 >>>> >>> number of threads: 15 >>>> >>> maximum number of tries: 1 >>>> >>> number of transactions per client: 100000 >>>> >>> number of transactions actually processed: 5000000/5000000 >>>> >>> number of failed transactions: 0 (0.000%) >>>> >>> latency average =3D 0.943 ms >>>> >>> initial connection time =3D 48.043 ms >>>> >>> tps =3D 53041.006947 (without initial connection time) >>>> >>> ``` >>>> >>> >>>> >>> ### Read-Only Workload >>>> >>> >>>> >>> ``` >>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10000= 0 -S example >>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>> >>> starting vacuum...end. >>>> >>> transaction type: >>>> >>> scaling factor: 50 >>>> >>> query mode: simple >>>> >>> number of clients: 50 >>>> >>> number of threads: 15 >>>> >>> maximum number of tries: 1 >>>> >>> number of transactions per client: 100000 >>>> >>> number of transactions actually processed: 5000000/5000000 >>>> >>> number of failed transactions: 0 (0.000%) >>>> >>> latency average =3D 0.121 ms >>>> >>> initial connection time =3D 53.358 ms >>>> >>> tps =3D 413436.248089 (without initial connection time) >>>> >>> ``` >>>> >>> >>>> >>> These numbers are with 50 concurrent clients and the actual dat= a stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory ca= ches, while cold data comes from S3. >>>> >>> >>>> >>> ## How It Works >>>> >>> >>>> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL= /ZFS can use like any other block device >>>> >>> 2. Multiple cache layers hide S3 latency: >>>> >>> a. ZFS ARC/L2ARC for frequently accessed blocks >>>> >>> b. ZeroFS memory cache for metadata and hot dataZeroFS expos= es NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any ot= her block device >>>> >>> c. Optional local disk cache >>>> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >>>> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS'= LSM-tree >>>> >>> >>>> >>> ## Geo-Distributed PostgreSQL >>>> >>> >>>> >>> Since each region can run its own ZeroFS instance, you can crea= te geographically distributed PostgreSQL setups. >>>> >>> >>>> >>> Example architectures: >>>> >>> >>>> >>> Architecture 1 >>>> >>> >>>> >>> >>>> >>> PostgreSQL Client >>>> >>> | >>>> >>> | SQL queries >>>> >>> | >>>> >>> +--------------+ >>>> >>> | PG Proxy | >>>> >>> | (HAProxy/ | >>>> >>> | PgBouncer) | >>>> >>> +--------------+ >>>> >>> / \ >>>> >>> / \ >>>> >>> Synchronous Synchronous >>>> >>> Replication Replication >>>> >>> / \ >>>> >>> / \ >>>> >>> +---------------+ +---------------+ >>>> >>> | PostgreSQL 1 | | PostgreSQL 2 | >>>> >>> | (Primary) |=E2=97=84------=E2=96=BA| (Stand= by) | >>>> >>> +---------------+ +---------------+ >>>> >>> | | >>>> >>> | POSIX filesystem ops | >>>> >>> | | >>>> >>> +---------------+ +---------------+ >>>> >>> | ZFS Pool 1 | | ZFS Pool 2 | >>>> >>> | (3-way mirror)| | (3-way mirror)| >>>> >>> +---------------+ +---------------+ >>>> >>> / | \ / | \ >>>> >>> / | \ / | \ >>>> >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:= 10814 >>>> >>> | | | | | | >>>> >>> +--------++--------++--------++--------++--------++----= ----+ >>>> >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||Zero= FS 6| >>>> >>> +--------++--------++--------++--------++--------++----= ----+ >>>> >>> | | | | | | >>>> >>> | | | | | | >>>> >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 = S3-Region6 >>>> >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap= -east) >>>> >>> >>>> >>> Architecture 2: >>>> >>> >>>> >>> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Sta= ndby (Region 2) >>>> >>> \ / >>>> >>> \ / >>>> >>> Same ZFS Pool (NBD) >>>> >>> | >>>> >>> 6 Global ZeroFS >>>> >>> | >>>> >>> S3 Regions >>>> >>> >>>> >>> >>>> >>> The main advantages I see are: >>>> >>> 1. Dramatic cost reduction for large datasets >>>> >>> 2. Simplified geo-distribution >>>> >>> 3. Infinite storage capacity >>>> >>> 4. Built-in encryption and compression >>>> >>> >>>> >>> Looking forward to your feedback and questions! >>>> >>> >>>> >>> Best, >>>> >>> Pierre >>>> >>> >>>> >>> P.S. The full project includes a custom NFS filesystem too. >>>> >>> >>>> > >>>>=20 >>=20 --d5aa1bab362c4f96945cfcf1af96a23b Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Ah, by "shar= ed storage" I mean that each node can acquire exclusivity, not that they= can both R/W to it at the same time.

> = ;Some pretty well-known cases of storage / compute separation (Aurora, N= eon) also share the storage between instances,

= That model is cool, but I think it's more of a solution for outliers as = I was suggesting, not something that most would or should want.

Best,
Pierre

On Sat,= Jul 26, 2025, at 09:42, Vladimir Churyukin wrote:
Sorry, I was referr= ing to this:

>  But when PostgreSQL instances share storage rather than= replicate:
> - Consi= stency seems maintained (same data)
> - Availability seems maintained (client can always prom= ote an accessible node)
= > - Partitions between PostgreSQL nodes don't prevent the system from= functioning

Some pretty well-known cases of storage / compute separ= ation (Aurora, Neon) also share the storage between instances,
that's why I'm a bit confused by your reply. I thought you're thinking = about this approach too, that's why I mentioned what kind of challenges = one may have on that path.


On Sat, Jul 26, 2025 at 12:36=E2=80=AFAM Pierre Barre <pierre@barre.sh> wrote:
<= br>
What you describe doesn=E2=80=99t look like something= very useful for the vast majority of projects that needs a database. Wh= y would you even want that if you can avoid it? 

If your =E2=80=9Csingle node=E2=80=9D can handle tens / hundreds = of thousands requests per second, still have very durable and highly ava= ilable storage, as well as fast recovery mechanisms, what=E2=80=99s the = point?

I am not trying to cater to extreme outl= iers that may want very weird like this, that=E2=80=99s just not the use= -cases I want to address, because I believe they are few and far between= .

Best,
Pierre 

On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
=
A shared storage would require a lot of extra work. That's ess= entially what AWS Aurora does.
You will have to have functiona= lity to sync in-memory states between nodes, because all the instances w= ill have cached data that can easily become stale on any write operation= .
That alone is not that simple. You will have to modify some = locking logic. Most likely do a lot of other changes in a lot of places,= Postgres was not just built with the assumption that the storage can be= shared.

-Vladimir

On Fri, Jul 18, 2025 at 5:31=E2=80=AFAM Pierre Barre= <pierre@barre.s= h> wrote:
Now, I'm trying to understand how CAP theorem applies here. Traditiona= l PostgreSQL replication has clear CAP trade-offs - you choose between c= onsistency and availability during partitions.

= But when PostgreSQL instances share storage rather than replicate:
=
- Consistency seems maintained (same data)
- Availability= seems maintained (client can always promote an accessible node)
- Partitions between PostgreSQL nodes don't prevent the system from f= unctioning

It seems that CAP assumes specific i= mplementation details (like nodes maintaining independent state) without= explicitly stating them.

How should we think a= bout CAP theorem when distributed nodes share storage rather than coordi= nate state? Are the trade-offs simply moved to a different layer, or doe= s shared storage fundamentally change the analysis?

=
Client with awareness of both PostgreSQL nodes
  &nb= sp; |                  &nbs= p;            |
    =E2= =86=93 (partition here)              = =E2=86=93
PostgreSQL Primary         =     PostgreSQL Standby
    |   =                     &= nbsp;      |
    =E2=94=94=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98
&= nbsp;               =E2=86=93
         Shared ZFS Pool
  &n= bsp;             |
   =      6 Global ZeroFS instances

= Best,
Pierre

On Fri, Jul 18, 2025, at= 12:57, Pierre Barre wrote:
> Hi Seref,
>
> For the benchmarks, I used Hetzner's cloud service with the f= ollowing setup:
>
> - A Hetzner s3 bucket in t= he FSN1 region
> - A virtual machine of type ccx63 48 vCPU = 192 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 20= 0GB zfs L2ARC
> - Postgres configured accordingly memory-wi= se as well as with synchronous_commit =3D off, wal_init_zero =3D off and= wal_recycle =3D off.
>
> Best,
>= Pierre
>
> On Fri, Jul 18, 2025, at 12:42, Se= ref Arikan wrote:
>> Sorry, this was meant to go to the = whole group:
>>
>> Very interesting!. Gr= eat work. Can you clarify how exactly you're running postgres in your te= sts? A specific AWS service? What's the test infrastructure that sits ab= ove the file system?
>>
>> On Thu, Jul 1= 7, 2025 at 11:59=E2=80=AFPM Pierre Barre <pierre@barre.sh> wrote:
>&= gt;> Hi everyone,
>>>
>>> I wan= ted to share a project I've been working on that enables PostgreSQL to r= un on S3 storage while maintaining performance comparable to local NVMe.= The approach uses block-level access rather than trying to map filesyst= em operations to S3 objects.
>>>
>&g= t;>
>>> # The Architecture
>>><= /div>
>>> ZeroFS provides NBD (Network Block Device) server= s that expose S3 storage as raw block devices. PostgreSQL runs unmodifie= d on ZFS pools built on these block devices:
>>>
>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3<= /div>
>>>
>>> By providing block-level a= ccess and leveraging ZFS's caching capabilities (L2ARC), we can achieve = microsecond latencies despite the underlying storage being in S3.
<= div>>>>
>>> ## Performance Results
>>>
>>> Here are pgbench results from Postg= reSQL running on this setup:
>>>
>>&g= t; ### Read/Write Workload
>>>
>>>= ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -= c 50 -j 15 -t 100000 example
>>> pgbench (16.9 (Ubunt= u 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.<= /div>
>>> transaction type: <builtin: TPC-B (sort of)>= ;
>>> scaling factor: 50
>>> query= mode: simple
>>> number of clients: 50
>= ;>> number of threads: 15
>>> maximum number of= tries: 1
>>> number of transactions per client: 1000= 00
>>> number of transactions actually processed: 500= 0000/5000000
>>> number of failed transactions: 0 (0.= 000%)
>>> latency average =3D 0.943 ms
>= >> initial connection time =3D 48.043 ms
>>> tp= s =3D 53041.006947 (without initial connection time)
>>&= gt; ```
>>>
>>> ### Read-Only Work= load
>>>
>>> ```
>>= ;> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -= S example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.= 04.1))
>>> starting vacuum...end.
>>&= gt; transaction type: <builtin: select only>
>>>= ; scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of thr= eads: 15
>>> maximum number of tries: 1
>= ;>> number of transactions per client: 100000
>>&g= t; number of transactions actually processed: 5000000/5000000
= >>> number of failed transactions: 0 (0.000%)
>>= ;> latency average =3D 0.121 ms
>>> initial connec= tion time =3D 53.358 ms
>>> tps =3D 413436.248089 (wi= thout initial connection time)
>>> ```
>= >>
>>> These numbers are with 50 concurrent cli= ents and the actual data stored in S3. Hot data is served from ZFS L2ARC= and ZeroFS's memory caches, while cold data comes from S3.
&g= t;>>
>>> ## How It Works
>>>=
>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) = that PostgreSQL/ZFS can use like any other block device
>&g= t;> 2. Multiple cache layers hide S3 latency:
>>>&= nbsp;   a. ZFS ARC/L2ARC for frequently accessed blocks
&= gt;>>    b. ZeroFS memory cache for metadata and hot dat= aZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can us= e like any other block device
>>>    c. Opt= ional local disk cache
>>> 3. All data is encrypted (= ChaCha20-Poly1305) before hitting S3
>>> 4. Files are= split into 128KB chunks for insertion into ZeroFS' LSM-tree
&= gt;>>
>>> ## Geo-Distributed PostgreSQL
>>>
>>> Since each region can run its own= ZeroFS instance, you can create geographically distributed PostgreSQL s= etups.
>>>
>>> Example architectur= es:
>>>
>>> Architecture 1
>>>
>>>
>>>  &nbs= p;                    =   PostgreSQL Client
>>>      &nb= sp;                    = ;         |
>>>    &nbs= p;                    =           | SQL queries
>>>=                     &n= bsp;               |
>&g= t;>                  &nb= sp;          +--------------+
>>= ;>                  &nbs= p;          |  PG Proxy    |
>>>              &nbs= p;              | (HAProxy/  &nb= sp; |
>>>            &n= bsp;                |  PgBo= uncer)  |
>>>          =                    +--= ------------+
>>>          &= nbsp;                   &nb= sp; /        \
>>>    &= nbsp;                   &nb= sp;      /          \
= >>>                &nbs= p;   Synchronous            Synchrono= us
>>>             = ;       Replication          &nb= sp; Replication
>>>         =                    /&= nbsp;             \
>>>=                     &n= bsp;       /            &nb= sp;   \
>>>          &n= bsp;    +---------------+        +--------= -------+
>>>           =    | PostgreSQL 1  |        | Postgr= eSQL 2  |
>>>          =      | (Primary)     |=E2=97=84------=E2=96= =BA| (Standby)     |
>>>    =            +---------------+    =     +---------------+
>>>    &nbs= p;                  | =                     &= nbsp; |
>>>            =            |  POSIX filesystem ops&nb= sp; |
>>>            &n= bsp;          |        &nbs= p;               |
>>= >               +------------= ---+        +---------------+
>>>=                |   ZFS= Pool 1  |        |   ZFS Pool 2 = ; |
>>>            &nbs= p;  | (3-way mirror)|        | (3-way mirror)|<= /div>
>>>              &= nbsp;+---------------+        +---------------+
>>>               = ; /      |      \      &nbs= p;   /      |      \
>&g= t;>               /  &nb= sp;    |       \       = ; /       |       \
&g= t;>>         NBD:10809 NBD:10810 NBD:1081= 1  NBD:10812 NBD:10813 NBD:10814
>>>   = ;           |        | = ;       |           | =       |        |
>>&= gt;         +--------++--------++--------++----= ----++--------++--------+
>>>      &nb= sp;  |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>>         +--------++------= --++--------++--------++--------++--------+
>>> =             |        =  |         |        &n= bsp;|         |        &nbs= p;|
>>>            &nbs= p; |         |         = ;|         |         |=          |
>>>    =      S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Regio= n5 S3-Region6
>>>         (u= s-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
&g= t;>>
>>> Architecture 2:
>>>=
>>> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92= PostgreSQL Standby (Region 2)
>>>     = ;            \        =             /
>>> =                 \    =               /
>>>= ;                   Sa= me ZFS Pool (NBD)
>>>        &nbs= p;                 |
&= gt;>>                 = ;  6 Global ZeroFS
>>>       = ;                   |
=
>>>               =        S3 Regions
>>>
= >>>
>>> The main advantages I see are:
=
>>> 1. Dramatic cost reduction for large datasets
>>> 2. Simplified geo-distribution
>>> 3. = Infinite storage capacity
>>> 4. Built-in encryption = and compression
>>>
>>> Looking fo= rward to your feedback and questions!
>>>
&= gt;>> Best,
>>> Pierre
>>>
>>> P.S. The full project includes a custom NFS filesy= stem too.
>>>
>



--d5aa1bab362c4f96945cfcf1af96a23b--