Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ufaXS-009Fu6-Rq for pgsql-general@arkaria.postgresql.org; Sat, 26 Jul 2025 08:45:56 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ufaXR-00843u-Mh for pgsql-general@arkaria.postgresql.org; Sat, 26 Jul 2025 08:45:53 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ufaXQ-00843m-OI for pgsql-general@lists.postgresql.org; Sat, 26 Jul 2025 08:45:53 +0000 Received: from fhigh-b4-smtp.messagingengine.com ([202.12.124.155]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1ufaXM-000vdF-11 for pgsql-general@lists.postgresql.org; Sat, 26 Jul 2025 08:45:52 +0000 Received: from phl-compute-01.internal (phl-compute-01.phl.internal [10.202.2.41]) by mailfhigh.stl.internal (Postfix) with ESMTP id B08F77A0BAE; Sat, 26 Jul 2025 04:45:45 -0400 (EDT) Received: from phl-imap-04 ([10.202.2.82]) by phl-compute-01.internal (MEProxy); Sat, 26 Jul 2025 04:45:45 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=barre.sh; h=cc :cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1753519545; x=1753605945; bh=pSC9jAD5JN mOfNPcveDUrT0fXBFYsSRUG87ryqiM21Y=; b=fQssTF+QHDu92/5/LbWHv6MCtl jipJGV2bgHsdnDW1MXIGPNu+Ko5jAIuR9RvS/HaZyvq6wbgnSmIc+g/0CFQARbJo QBevQtFWuHuCXiSyKh5c+kaeu64k8CqtouTxdHZvmhEc71OjTk+m90eKSRtkCvPG XGhiImrMC2CqQlUet2j9L5P5ztjX28iUpXwBwsErw9M8J0UzDLgvMk8x1/3lXG+c ex7gtsRYtTIPgnj/kSA8m/4Kup6pAaYifk/Ive98AwUxUpqm7VV5Ny0a7Jtz1+ci ns021XJF0a7MJPpBVtaQBGylTISOAsp21sCjlwGT4NZWSSof61wx3gZsyK6A== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1753519545; x=1753605945; bh=pSC9jAD5JNmOfNPcveDUrT0fXBFYsSRUG87 ryqiM21Y=; b=Ib5CzNrB6PlNQAGgHmJgw7nuaPmlriChSu6NJ7wKXnW6op6aYfV x50AGL+AnG9vilzoMmDG5I8ai20DDfHqNTs34ef/eugEL5CvT3M/++XQAjhwHeTh heYup2YQZVJ9PUbgTAdeBUw/bPoyfOsZuDhYENOGPj7Tgk1RQ2sIfv1tW8mAuh6T ACL2zWrtA/YIxhGWxqBXSq4ZHJRruxm/c75Oy9UDEV1YhcWgPESg8pGkwSZoqVMa u4Ge0Dvv4mpOLxi7L5re2gnDLfgl1LXI7MO4zDtKSHQBz4yEkC3omeOYC8sd5Zh7 A6X3Fr7q8KR8simCo67ACCeWcawVUdmrheQ== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdekheeljecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefoggffhffvvefkjghfufgtsegrtderreertdejnecuhfhrohhmpedfrfhivghrrhgv uceurghrrhgvfdcuoehpihgvrhhrvgessggrrhhrvgdrshhhqeenucggtffrrghtthgvrh hnpeejgefgiedthefgudevleejueehvdejgeekueekudekteduuefftdfhudeihedvhfen ucffohhmrghinhepnhgvohhnrdgtohhmpdgrmhgriihonhdrtghomhdpmhgvrhhklhgvmh grphdrtghomhdptggrlhgtuhhlrghtohhrrdgrfihspdhgihhthhhusgdrtghomhenucev lhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehpihgvrhhrvg essggrrhhrvgdrshhhpdhnsggprhgtphhtthhopedvpdhmohguvgepshhmthhpohhuthdp rhgtphhtthhopehvlhgrughimhhirhestghhuhhrhihukhhinhdrtghomhdprhgtphhtth hopehpghhsqhhlqdhgvghnvghrrghlsehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhr gh X-ME-Proxy: Feedback-ID: i97614980:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id EE988B6006B; Sat, 26 Jul 2025 04:45:44 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface MIME-Version: 1.0 X-ThreadId: T89c86ea8eb4c36ce Date: Sat, 26 Jul 2025 10:44:41 +0200 From: "Pierre Barre" To: "Vladimir Churyukin" Cc: pgsql-general@lists.postgresql.org Message-Id: <8f750558-17b5-4d87-a03f-0dcfdbf4899c@app.fastmail.com> In-Reply-To: <44dafe90-9ad6-41ae-b9fe-bea4aaf49a59@app.fastmail.com> References: <8188513c-e089-4273-b2be-16dd0a5a0a80@app.fastmail.com> <96edd171-9cbe-466d-b3d6-04e069cee419@app.fastmail.com> <44dafe90-9ad6-41ae-b9fe-bea4aaf49a59@app.fastmail.com> Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance Content-Type: multipart/alternative; boundary=4b20097cb77a440d9fe2b0f0882b7a21 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --4b20097cb77a440d9fe2b0f0882b7a21 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Also, Neon [0] and Aurora [1] pricing is so high that it seems to make m= ost use-cases impractical (well, if you want a managed offering...). Neo= n's top public tier is not even what a single modern dedicated server (o= r virtual machine) can provide. I would have thought decoupling compute = and storage would make the offerings cheaper, if anything. Taking my own Merklemap [2] use-case where I run a 30TB database with Ne= on pricing (and I don't doubt that the non-public pricing would be even = more expensive than that): Storage Scaling: - Business plan: 500 GB -> $700 - You need: 30,000 GB (30 TB) - Scaling factor: 60x - Linear estimate: $700 =C3=97 60 =3D $42,000/month - Total 12 months cost: $504,000 Aurora calculation [3]: - Instance type: db.r5.24xlarge - Monthly cost: $21,887.28 - Total 12 months cost: $262,647.36 Now, calculating the same 30TB with the same instance type and S3 storag= e [4]: - Instance Type: r5.24xlarge - Monthly cost: $5,555.04 - Total 12 months cost: $66,660.48 But more interestingly, you don't need to use AWS at all anymore, becaus= e you can just move your setup anywhere at this point, as you get a simi= lar level of reliability - and simplicity - but with very cheap services. Hetzner ccx63 + Cloudflare R2: - Hetzner ccx63: =E2=82=AC287.99/month =E2=89=88 $338/month - R2 storage (30TB): 30,000 GB =C3=97 $0.015 =3D $450/month - R2 operations: Should be measured to be calculated properly, but will = probably be negligible. - Total monthly: ~$760 - Total 12 months cost: $9,120/year Best, Pierre [0] https://neon.com/pricing [1] https://aws.amazon.com/rds/aurora/pricing/ [2] https://www.merklemap.com/ [3] https://calculator.aws/#/estimate?id=3D3f0ce6a91eed9a666d54bb8852ea0= 0b042c3cd6e [4] https://calculator.aws/#/estimate?id=3D1a77d8da3489bafc8681c6fd738a3= 186fb749ea3 On Sat, Jul 26, 2025, at 09:51, Pierre Barre wrote: > Ah, by "shared storage" I mean that each node can acquire exclusivity,= not that they can both R/W to it at the same time. >=20 > > Some pretty well-known cases of storage / compute separation (Aurora= , Neon) also share the storage between instances, >=20 > That model is cool, but I think it's more of a solution for outliers a= s I was suggesting, not something that most would or should want. >=20 > Best, > Pierre >=20 > On Sat, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote: >> Sorry, I was referring to this: >>=20 >> > But when PostgreSQL instances share storage rather than replicate: >> > - Consistency seems maintained (same data) >> > - Availability seems maintained (client can always promote an acces= sible node) >> > - Partitions between PostgreSQL nodes don't prevent the system from= functioning >>=20 >> Some pretty well-known cases of storage / compute separation (Aurora,= Neon) also share the storage between instances, >> that's why I'm a bit confused by your reply. I thought you're thinkin= g about this approach too, that's why I mentioned what kind of challenge= s one may have on that path. >>=20 >>=20 >> On Sat, Jul 26, 2025 at 12:36=E2=80=AFAM Pierre Barre wrote: >>> __ >>> What you describe doesn=E2=80=99t look like something very useful fo= r the vast majority of projects that needs a database. Why would you eve= n want that if you can avoid it?=20 >>>=20 >>> If your =E2=80=9Csingle node=E2=80=9D can handle tens / hundreds of = thousands requests per second, still have very durable and highly availa= ble storage, as well as fast recovery mechanisms, what=E2=80=99s the poi= nt? >>>=20 >>> I am not trying to cater to extreme outliers that may want very weir= d like this, that=E2=80=99s just not the use-cases I want to address, be= cause I believe they are few and far between. >>>=20 >>> Best, >>> Pierre=20 >>>=20 >>> On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote: >>>> A shared storage would require a lot of extra work. That's essentia= lly what AWS Aurora does. >>>> You will have to have functionality to sync in-memory states betwee= n nodes, because all the instances will have cached data that can easily= become stale on any write operation. >>>> That alone is not that simple. You will have to modify some locking= logic. Most likely do a lot of other changes in a lot of places, Postgr= es was not just built with the assumption that the storage can be shared. >>>>=20 >>>> -Vladimir >>>>=20 >>>> On Fri, Jul 18, 2025 at 5:31=E2=80=AFAM Pierre Barre wrote: >>>>> Now, I'm trying to understand how CAP theorem applies here. Tradit= ional PostgreSQL replication has clear CAP trade-offs - you choose betwe= en consistency and availability during partitions. >>>>>=20 >>>>> But when PostgreSQL instances share storage rather than replicate: >>>>> - Consistency seems maintained (same data) >>>>> - Availability seems maintained (client can always promote an acce= ssible node) >>>>> - Partitions between PostgreSQL nodes don't prevent the system fro= m functioning >>>>>=20 >>>>> It seems that CAP assumes specific implementation details (like no= des maintaining independent state) without explicitly stating them. >>>>>=20 >>>>> How should we think about CAP theorem when distributed nodes share= storage rather than coordinate state? Are the trade-offs simply moved t= o a different layer, or does shared storage fundamentally change the ana= lysis? >>>>>=20 >>>>> Client with awareness of both PostgreSQL nodes >>>>> | | >>>>> =E2=86=93 (partition here) =E2=86=93 >>>>> PostgreSQL Primary PostgreSQL Standby >>>>> | | >>>>> =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=98 >>>>> =E2=86=93 >>>>> Shared ZFS Pool >>>>> | >>>>> 6 Global ZeroFS instances >>>>>=20 >>>>> Best, >>>>> Pierre >>>>>=20 >>>>> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote: >>>>> > Hi Seref, >>>>> > >>>>> > For the benchmarks, I used Hetzner's cloud service with the foll= owing setup: >>>>> > >>>>> > - A Hetzner s3 bucket in the FSN1 region >>>>> > - A virtual machine of type ccx63 48 vCPU 192 GB memory >>>>> > - 3 ZeroFS nbd devices (same s3 bucket) >>>>> > - A ZFS stripped pool with the 3 devices >>>>> > - 200GB zfs L2ARC >>>>> > - Postgres configured accordingly memory-wise as well as with sy= nchronous_commit =3D off, wal_init_zero =3D off and wal_recycle =3D off. >>>>> > >>>>> > Best, >>>>> > Pierre >>>>> > >>>>> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: >>>>> >> Sorry, this was meant to go to the whole group: >>>>> >> >>>>> >> Very interesting!. Great work. Can you clarify how exactly you'= re running postgres in your tests? A specific AWS service? What's the te= st infrastructure that sits above the file system? >>>>> >> >>>>> >> On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre wrote: >>>>> >>> Hi everyone, >>>>> >>> >>>>> >>> I wanted to share a project I've been working on that enables = PostgreSQL to run on S3 storage while maintaining performance comparable= to local NVMe. The approach uses block-level access rather than trying = to map filesystem operations to S3 objects. >>>>> >>> >>>>> >>> ZeroFS: https://github.com/Barre/ZeroFS >>>>> >>> >>>>> >>> # The Architecture >>>>> >>> >>>>> >>> ZeroFS provides NBD (Network Block Device) servers that expose= S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pool= s built on these block devices: >>>>> >>> >>>>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >>>>> >>> >>>>> >>> By providing block-level access and leveraging ZFS's caching c= apabilities (L2ARC), we can achieve microsecond latencies despite the un= derlying storage being in S3. >>>>> >>> >>>>> >>> ## Performance Results >>>>> >>> >>>>> >>> Here are pgbench results from PostgreSQL running on this setup: >>>>> >>> >>>>> >>> ### Read/Write Workload >>>>> >>> >>>>> >>> ``` >>>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 1000= 00 example >>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>>> >>> starting vacuum...end. >>>>> >>> transaction type: >>>>> >>> scaling factor: 50 >>>>> >>> query mode: simple >>>>> >>> number of clients: 50 >>>>> >>> number of threads: 15 >>>>> >>> maximum number of tries: 1 >>>>> >>> number of transactions per client: 100000 >>>>> >>> number of transactions actually processed: 5000000/5000000 >>>>> >>> number of failed transactions: 0 (0.000%) >>>>> >>> latency average =3D 0.943 ms >>>>> >>> initial connection time =3D 48.043 ms >>>>> >>> tps =3D 53041.006947 (without initial connection time) >>>>> >>> ``` >>>>> >>> >>>>> >>> ### Read-Only Workload >>>>> >>> >>>>> >>> ``` >>>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 1000= 00 -S example >>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>>> >>> starting vacuum...end. >>>>> >>> transaction type: >>>>> >>> scaling factor: 50 >>>>> >>> query mode: simple >>>>> >>> number of clients: 50 >>>>> >>> number of threads: 15 >>>>> >>> maximum number of tries: 1 >>>>> >>> number of transactions per client: 100000 >>>>> >>> number of transactions actually processed: 5000000/5000000 >>>>> >>> number of failed transactions: 0 (0.000%) >>>>> >>> latency average =3D 0.121 ms >>>>> >>> initial connection time =3D 53.358 ms >>>>> >>> tps =3D 413436.248089 (without initial connection time) >>>>> >>> ``` >>>>> >>> >>>>> >>> These numbers are with 50 concurrent clients and the actual da= ta stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory c= aches, while cold data comes from S3. >>>>> >>> >>>>> >>> ## How It Works >>>>> >>> >>>>> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQ= L/ZFS can use like any other block device >>>>> >>> 2. Multiple cache layers hide S3 latency: >>>>> >>> a. ZFS ARC/L2ARC for frequently accessed blocks >>>>> >>> b. ZeroFS memory cache for metadata and hot dataZeroFS expo= ses NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any o= ther block device >>>>> >>> c. Optional local disk cache >>>>> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >>>>> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS= ' LSM-tree >>>>> >>> >>>>> >>> ## Geo-Distributed PostgreSQL >>>>> >>> >>>>> >>> Since each region can run its own ZeroFS instance, you can cre= ate geographically distributed PostgreSQL setups. >>>>> >>> >>>>> >>> Example architectures: >>>>> >>> >>>>> >>> Architecture 1 >>>>> >>> >>>>> >>> >>>>> >>> PostgreSQL Client >>>>> >>> | >>>>> >>> | SQL queries >>>>> >>> | >>>>> >>> +--------------+ >>>>> >>> | PG Proxy | >>>>> >>> | (HAProxy/ | >>>>> >>> | PgBouncer) | >>>>> >>> +--------------+ >>>>> >>> / \ >>>>> >>> / \ >>>>> >>> Synchronous Synchronous >>>>> >>> Replication Replication >>>>> >>> / \ >>>>> >>> / \ >>>>> >>> +---------------+ +---------------+ >>>>> >>> | PostgreSQL 1 | | PostgreSQL 2 | >>>>> >>> | (Primary) |=E2=97=84------=E2=96=BA| (Stan= dby) | >>>>> >>> +---------------+ +---------------+ >>>>> >>> | | >>>>> >>> | POSIX filesystem ops | >>>>> >>> | | >>>>> >>> +---------------+ +---------------+ >>>>> >>> | ZFS Pool 1 | | ZFS Pool 2 | >>>>> >>> | (3-way mirror)| | (3-way mirror)| >>>>> >>> +---------------+ +---------------+ >>>>> >>> / | \ / | \ >>>>> >>> / | \ / | \ >>>>> >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD= :10814 >>>>> >>> | | | | | | >>>>> >>> +--------++--------++--------++--------++--------++---= -----+ >>>>> >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||Zer= oFS 6| >>>>> >>> +--------++--------++--------++--------++--------++---= -----+ >>>>> >>> | | | | | = | >>>>> >>> | | | | | = | >>>>> >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5= S3-Region6 >>>>> >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (a= p-east) >>>>> >>> >>>>> >>> Architecture 2: >>>>> >>> >>>>> >>> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL St= andby (Region 2) >>>>> >>> \ / >>>>> >>> \ / >>>>> >>> Same ZFS Pool (NBD) >>>>> >>> | >>>>> >>> 6 Global ZeroFS >>>>> >>> | >>>>> >>> S3 Regions >>>>> >>> >>>>> >>> >>>>> >>> The main advantages I see are: >>>>> >>> 1. Dramatic cost reduction for large datasets >>>>> >>> 2. Simplified geo-distribution >>>>> >>> 3. Infinite storage capacity >>>>> >>> 4. Built-in encryption and compression >>>>> >>> >>>>> >>> Looking forward to your feedback and questions! >>>>> >>> >>>>> >>> Best, >>>>> >>> Pierre >>>>> >>> >>>>> >>> P.S. The full project includes a custom NFS filesystem too. >>>>> >>> >>>>> > >>>>>=20 >>>=20 >=20 --4b20097cb77a440d9fe2b0f0882b7a21 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Also, Neon [= 0] and Aurora [1] pricing is so high that it seems to make most use-case= s impractical (well, if you want a managed offering...). Neon's top publ= ic tier is not even what a single modern dedicated server (or virtual ma= chine) can provide. I would have thought decoupling compute and storage = would make the offerings cheaper, if anything.

= Taking my own Merklemap [2] use-case where I run a 30TB database with Ne= on pricing (and I don't doubt that the non-public pricing would be even = more expensive than that):

Storage Scaling:

- Business plan: 500 GB -> $700
- You= need: 30,000 GB (30 TB)
- Scaling factor: 60x
- Lin= ear estimate: $700 =C3=97 60 =3D $42,000/month
- Total 12 mont= hs cost: $504,000

Aurora calculation [3]:
=

- Instance type: db.r5.24xlarge
- Monthly = cost: $21,887.28
- Total 12 months cost: $262,647.36

Now, calculating the same 30TB with the same instance ty= pe and S3 storage [4]:

- Instance Type: r5.24xl= arge
- Monthly cost: $5,555.04
- Total 12 months cos= t: $66,660.48

But more interestingly, you don't= need to use AWS at all anymore, because you can just move your setup an= ywhere at this point, as you get a similar level of reliability - and si= mplicity - but with very cheap services.

Hetzne= r ccx63 + Cloudflare R2:

- Hetzner ccx63: =E2=82= =AC287.99/month =E2=89=88 $338/month
- R2 storage (30TB): 30,0= 00 GB =C3=97 $0.015 =3D $450/month
- R2 operations: Should be = measured to be calculated properly, but will probably be negligible.
- Total monthly: ~$760
- Total 12 months cost: $9,120/y= ear

Best,
Pierre

=

On Sat, Jul 26, 2025, at 09:51, Pierre Barre w= rote:
Ah, by "s= hared storage" I mean that each node can acquire exclusivity, not that t= hey can both R/W to it at the same time.

>&n= bsp;Some pretty well-known cases of storage / compute separation (Aurora= , Neon) also share the storage between instances,

That model is cool, but I think it's more of a solution for outliers = as I was suggesting, not something that most would or should want.
=

Best,
Pierre

On S= at, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote:
Sorry, I was = referring to this:

>  But when PostgreSQL instances share stora= ge rather than replicate:
> - Consistency seems maintained (same data)
> - Availability seems mainta= ined (client can always promote an accessible node)
> - Partitions between PostgreSQL = nodes don't prevent the system from functioning
<= span style=3D"color:rgb(80, 0, 80);">
Some pretty w= ell-known cases of storage / compute separation (Aurora, Neon) also shar= e the storage between instances,
that's why I'm a bit confused= by your reply. I thought you're thinking about this approach too, that'= s why I mentioned what kind of challenges one may have on that path.
<= br>

On Sat, J= ul 26, 2025 at 12:36=E2=80=AFAM Pierre Barre <pierre@barre.sh> wrote:

What you describe doesn=E2=80=99t look like something very us= eful for the vast majority of projects that needs a database. Why would = you even want that if you can avoid it? 

I= f your =E2=80=9Csingle node=E2=80=9D can handle tens / hundreds of thous= ands requests per second, still have very durable and highly available s= torage, as well as fast recovery mechanisms, what=E2=80=99s the point?

I am not trying to cater to extreme outliers tha= t may want very weird like this, that=E2=80=99s just not the use-cases I= want to address, because I believe they are few and far between.
<= div>
Best,
Pierre 

On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
=
A shared storage would require a lot of extra work. That's essentia= lly what AWS Aurora does.
You will have to have functionality = to sync in-memory states between nodes, because all the instances will h= ave cached data that can easily become stale on any write operation.
That alone is not that simple. You will have to modify some locki= ng logic. Most likely do a lot of other changes in a lot of places, Post= gres was not just built with the assumption that the storage can be shar= ed.

-Vladimir

On Fri, Jul 18, 2025 at 5:31=E2=80=AFAM Pierre Barre <= pierre@barre.sh= > wrote:
Now= , I'm trying to understand how CAP theorem applies here. Traditional Pos= tgreSQL replication has clear CAP trade-offs - you choose between consis= tency and availability during partitions.

But w= hen PostgreSQL instances share storage rather than replicate:
= - Consistency seems maintained (same data)
- Availability seem= s maintained (client can always promote an accessible node)
- = Partitions between PostgreSQL nodes don't prevent the system from functi= oning

It seems that CAP assumes specific implem= entation details (like nodes maintaining independent state) without expl= icitly stating them.

How should we think about = CAP theorem when distributed nodes share storage rather than coordinate = state? Are the trade-offs simply moved to a different layer, or does sha= red storage fundamentally change the analysis?

= Client with awareness of both PostgreSQL nodes
    |=                     &n= bsp;          |
    =E2=86=93= (partition here)              =E2=86= =93
PostgreSQL Primary           = ;   PostgreSQL Standby
    |     = ;                     =      |
    =E2=94=94=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98
  =               =E2=86=93
&nb= sp;        Shared ZFS Pool
    &= nbsp;           |
     = ;    6 Global ZeroFS instances

Best,<= /div>
Pierre

On Fri, Jul 18, 2025, at 12:57= , Pierre Barre wrote:
> Hi Seref,
>
= > For the benchmarks, I used Hetzner's cloud service with the followi= ng setup:
>
> - A Hetzner s3 bucket in the FSN= 1 region
> - A virtual machine of type ccx63 48 vCPU 192 GB= memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB zf= s L2ARC
> - Postgres configured accordingly memory-wise as = well as with synchronous_commit =3D off, wal_init_zero =3D off and wal_r= ecycle =3D off.
>
> Best,
> Pierr= e
>
> On Fri, Jul 18, 2025, at 12:42, Seref Ar= ikan wrote:
>> Sorry, this was meant to go to the whole = group:
>>
>> Very interesting!. Great wo= rk. Can you clarify how exactly you're running postgres in your tests? A= specific AWS service? What's the test infrastructure that sits above th= e file system?
>>
>> On Thu, Jul 17, 202= 5 at 11:59=E2=80=AFPM Pierre Barre <pierre@barre.sh> wrote:
>>>= ; Hi everyone,
>>>
>>> I wanted to= share a project I've been working on that enables PostgreSQL to run on = S3 storage while maintaining performance comparable to local NVMe. The a= pproach uses block-level access rather than trying to map filesystem ope= rations to S3 objects.
>>>
>>>=
>>> # The Architecture
>>>
<= div>>>> ZeroFS provides NBD (Network Block Device) servers that= expose S3 storage as raw block devices. PostgreSQL runs unmodified on Z= FS pools built on these block devices:
>>>
= >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
<= div>>>>
>>> By providing block-level access = and leveraging ZFS's caching capabilities (L2ARC), we can achieve micros= econd latencies despite the underlying storage being in S3.
&g= t;>>
>>> ## Performance Results
>&= gt;>
>>> Here are pgbench results from PostgreSQL = running on this setup:
>>>
>>> ###= Read/Write Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -= j 15 -t 100000 example
>>> pgbench (16.9 (Ubuntu 16.9= -0ubuntu0.24.04.1))
>>> starting vacuum...end.
<= div>>>> transaction type: <builtin: TPC-B (sort of)>
>>> scaling factor: 50
>>> query mode:= simple
>>> number of clients: 50
>>&= gt; number of threads: 15
>>> maximum number of tries= : 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5= 000000
>>> number of failed transactions: 0 (0.000%)<= /div>
>>> latency average =3D 0.943 ms
>>&g= t; initial connection time =3D 48.043 ms
>>> tps =3D = 53041.006947 (without initial connection time)
>>> ``= `
>>>
>>> ### Read-Only Workload
>>>
>>> ```
>>> = postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S exam= ple
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))=
>>> starting vacuum...end.
>>> tr= ansaction type: <builtin: select only>
>>> scal= ing factor: 50
>>> query mode: simple
>&= gt;> number of clients: 50
>>> number of threads: = 15
>>> maximum number of tries: 1
>>&= gt; number of transactions per client: 100000
>>> num= ber of transactions actually processed: 5000000/5000000
>&g= t;> number of failed transactions: 0 (0.000%)
>>> = latency average =3D 0.121 ms
>>> initial connection t= ime =3D 53.358 ms
>>> tps =3D 413436.248089 (without = initial connection time)
>>> ```
>>&g= t;
>>> These numbers are with 50 concurrent clients a= nd the actual data stored in S3. Hot data is served from ZFS L2ARC and Z= eroFS's memory caches, while cold data comes from S3.
>>= >
>>> ## How It Works
>>>
=
>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that P= ostgreSQL/ZFS can use like any other block device
>>>= 2. Multiple cache layers hide S3 latency:
>>>  =   a. ZFS ARC/L2ARC for frequently accessed blocks
>>= ;>    b. ZeroFS memory cache for metadata and hot dataZeroF= S exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like= any other block device
>>>    c. Optional = local disk cache
>>> 3. All data is encrypted (ChaCha= 20-Poly1305) before hitting S3
>>> 4. Files are split= into 128KB chunks for insertion into ZeroFS' LSM-tree
>>= ;>
>>> ## Geo-Distributed PostgreSQL
>= ;>>
>>> Since each region can run its own ZeroF= S instance, you can create geographically distributed PostgreSQL setups.=
>>>
>>> Example architectures:
>>>
>>> Architecture 1
>= ;>>
>>>
>>>    &nb= sp;                    = ; PostgreSQL Client
>>>        &n= bsp;                   &nbs= p;       |
>>>      &nb= sp;                    = ;         | SQL queries
>>> =                     &= nbsp;             |
>>>=                     &n= bsp;        +--------------+
>>>&= nbsp;                   &nb= sp;        |  PG Proxy    |
>>>                &nb= sp;            | (HAProxy/    |<= /div>
>>>              &= nbsp;              |  PgBouncer)=   |
>>>           =                  +--------= ------+
>>>            =                     /&= nbsp;       \
>>>      =                     &n= bsp;    /          \
>&g= t;>                  &nb= sp; Synchronous            Synchronous
>>>              &nbs= p;     Replication            Re= plication
>>>           = ;                  /  =             \
>>> =                     &= nbsp;     /              &n= bsp; \
>>>            &= nbsp;  +---------------+        +--------------= -+
>>>             = ;  | PostgreSQL 1  |        | PostgreSQL 2=   |
>>>           =    | (Primary)     |=E2=97=84------=E2=96=BA| = (Standby)     |
>>>     = ;          +---------------+     = ;   +---------------+
>>>      &n= bsp;                |  &nbs= p;                    = |
>>>             = ;          |  POSIX filesystem ops  |=
>>>              =          |          &n= bsp;             |
>>>&= nbsp;              +---------------+&= nbsp;       +---------------+
>>> = ;              |   ZFS Pool= 1  |        |   ZFS Pool 2  |
>>>              &n= bsp;| (3-way mirror)|        | (3-way mirror)|
=
>>>               = +---------------+        +---------------+
>>>                /&n= bsp;     |      \        &n= bsp; /      |      \
>>>= ;               /    &= nbsp;  |       \        /&n= bsp;      |       \
>>= ;>         NBD:10809 NBD:10810 NBD:10811&nbs= p; NBD:10812 NBD:10813 NBD:10814
>>>    &nb= sp;         |        |  &nb= sp;     |           |  &nbs= p;     |        |
>>>&n= bsp;        +--------++--------++--------++--------+= +--------++--------+
>>>        &= nbsp;|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
<= div>>>>         +--------++--------++-= -------++--------++--------++--------+
>>>  &nbs= p;           |         = ;|         |         |=          |         |
>>>              |&= nbsp;        |         |&nb= sp;        |         | = ;        |
>>>     = ;    S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3= -Region6
>>>         (us-eas= t) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>= ;>
>>> Architecture 2:
>>>
>>> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 Post= greSQL Standby (Region 2)
>>>      &nb= sp;          \         = ;           /
>>>  &nbs= p;               \     = ;             /
>>>&nbs= p;                  Same ZF= S Pool (NBD)
>>>          &n= bsp;               |
>&g= t;>                  &nb= sp;6 Global ZeroFS
>>>        &nb= sp;                 |
= >>>                &nbs= p;      S3 Regions
>>>
>&= gt;>
>>> The main advantages I see are:
= >>> 1. Dramatic cost reduction for large datasets
>= ;>> 2. Simplified geo-distribution
>>> 3. Infin= ite storage capacity
>>> 4. Built-in encryption and c= ompression
>>>
>>> Looking forward= to your feedback and questions!
>>>
>&g= t;> Best,
>>> Pierre
>>>
<= div>>>> P.S. The full project includes a custom NFS filesystem = too.
>>>
>




--4b20097cb77a440d9fe2b0f0882b7a21--