Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ufZYn-0093Ow-C8 for pgsql-general@arkaria.postgresql.org; Sat, 26 Jul 2025 07:43:14 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ufZYm-007eUj-Ar for pgsql-general@arkaria.postgresql.org; Sat, 26 Jul 2025 07:43:12 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ufZYl-007eSO-NV for pgsql-general@lists.postgresql.org; Sat, 26 Jul 2025 07:43:12 +0000 Received: from mail-ed1-x52a.google.com ([2a00:1450:4864:20::52a]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1ufZYj-000qVy-1Z for pgsql-general@lists.postgresql.org; Sat, 26 Jul 2025 07:43:11 +0000 Received: by mail-ed1-x52a.google.com with SMTP id 4fb4d7f45d1cf-60c51860bf5so4872571a12.1 for ; Sat, 26 Jul 2025 00:43:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=churyukin-com.20230601.gappssmtp.com; s=20230601; t=1753515787; x=1754120587; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=WSi7HhwkBR3UMS1CBirARw+kMs0ycNWx4eyhrG5ClCQ=; b=CyG/INsDNnM6ew/OIE+3inREiDp1BdrjjZXG0fqN9wrbd2X03g3ApEe6p5QiEYdqeA TYhiV41QJUvnlxPEZoZN/FfA4AejNftSN203w9MwP8169GQZHoCa/iDGBU6Xrd5JT9VY Y9wVPsE1Qj6T/fO3bI6jbgpMdZbmKvupamFXuzWccJeDh4yQo+doTXHRMLrxupj2VTnK KyESoN7SQWb6dtE1kN84OOm2LoucuZYXdhtyaTNeVQQTnbb4E0BLri9YxShgC164Sk+n x+jzTSgt2RFxyjy4wA8wTDwSNIBW/Qiu1I96QUNJhDo+tibs+yok3Wu5nZYbXIpOXI0y GY/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753515787; x=1754120587; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WSi7HhwkBR3UMS1CBirARw+kMs0ycNWx4eyhrG5ClCQ=; b=VBUgczSMKj66aBwArjzJNsZ5VzKsnFkOslSsOOdSl10b2nrB+zDyboWyw943yboSf0 qPijD18DUZ0hCXuWsOIFMQyFej6YzIB4m89SxVQ9WnlpilbKBrUmKvvM2FbzlLpoLp/s UnRQwpf2pMOKR0XV5mqBrE6nI/+JGql2xDMLL1xIUDMzHScvHDuCZxkfDDfz01FG1Xxh M28sh6Bw5/kqdRliQh8yJ/QEVP3vlfIB0Bqw54eR8I6uKk1iuUSQcqrFbA6xJg1RvSq6 BvWxRtT5ZFO9xWZFVN107blRzSmXFYWn8WjOk42MS1eI4pmQV/7sv6sGQLlx+JOPl6rJ JBDA== X-Gm-Message-State: AOJu0YyVdwbsj4+nvVOB+I2q/Es1CYvh1plJpUfNa2Wo+tnMZ6y6hVJc TxqMeWlHWCJD+QacMJge6qHLN1FuXYAxkSr0SdZH4LlN8zfDaXT4wr/aaXKGIOzgacW2lrRgyH1 LolUCHe4PiJ0oWzFMSoiMrbtXXcACmC8= X-Gm-Gg: ASbGncsVWSS8MVVqPpYc34sVxYjdUWP0yhMZCBtaSCQcWQV6Fvfo4FyMAqzEGYZ0MOT U970DuWPQHafHpSxTy+zPmqR3Cg8BIVnozciqqErjkQB8k05bP43BWJNv1y2B794rdB/6pC1JDA D6zdtcmQjKbClpJDrINOODdUPITqKMUT236jbFmvR4WOtFd3Kxd5qV+FDOBsM+5AyvJE9FdPMLB OM1+eBZMptQil6ND7VtsZ7LRKmAZEFzrgdiQ5cJYKFBUPou X-Google-Smtp-Source: AGHT+IGjMbTp5aMG2pVGr8LM8XY2aPFecLf2hPUVNy3ZHiu5scqtFoKAskn4gFgy62d6EBec3PVaxlyp2ap9GSfRW9M= X-Received: by 2002:a17:907:2d28:b0:ae0:da2f:dcf3 with SMTP id a640c23a62f3a-af61a0f7525mr531680266b.59.1753515786130; Sat, 26 Jul 2025 00:43:06 -0700 (PDT) MIME-Version: 1.0 References: <8188513c-e089-4273-b2be-16dd0a5a0a80@app.fastmail.com> <96edd171-9cbe-466d-b3d6-04e069cee419@app.fastmail.com> In-Reply-To: <96edd171-9cbe-466d-b3d6-04e069cee419@app.fastmail.com> From: Vladimir Churyukin Date: Sat, 26 Jul 2025 00:42:54 -0700 X-Gm-Features: Ac12FXx2ra2vbwBVnVeJXHxKhJtUT38puybAcNywoTMI65eDLARDq_Jr7-r1XyQ Message-ID: Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance To: Pierre Barre Cc: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000008a5aa8063ad035b0" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000008a5aa8063ad035b0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sorry, I was referring to this: > But when PostgreSQL instances share storage rather than replicate: > - Consistency seems maintained (same data) > - Availability seems maintained (client can always promote an accessible node) > - Partitions between PostgreSQL nodes don't prevent the system from functioning Some pretty well-known cases of storage / compute separation (Aurora, Neon) also share the storage between instances, that's why I'm a bit confused by your reply. I thought you're thinking about this approach too, that's why I mentioned what kind of challenges one may have on that path. On Sat, Jul 26, 2025 at 12:36=E2=80=AFAM Pierre Barre wro= te: > What you describe doesn=E2=80=99t look like something very useful for the= vast > majority of projects that needs a database. Why would you even want that = if > you can avoid it? > > If your =E2=80=9Csingle node=E2=80=9D can handle tens / hundreds of thous= ands requests per > second, still have very durable and highly available storage, as well as > fast recovery mechanisms, what=E2=80=99s the point? > > I am not trying to cater to extreme outliers that may want very weird lik= e > this, that=E2=80=99s just not the use-cases I want to address, because I = believe > they are few and far between. > > Best, > Pierre > > On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote: > > A shared storage would require a lot of extra work. That's essentially > what AWS Aurora does. > You will have to have functionality to sync in-memory states between > nodes, because all the instances will have cached data that can easily > become stale on any write operation. > That alone is not that simple. You will have to modify some locking logic= . > Most likely do a lot of other changes in a lot of places, Postgres was no= t > just built with the assumption that the storage can be shared. > > -Vladimir > > On Fri, Jul 18, 2025 at 5:31=E2=80=AFAM Pierre Barre wr= ote: > > Now, I'm trying to understand how CAP theorem applies here. Traditional > PostgreSQL replication has clear CAP trade-offs - you choose between > consistency and availability during partitions. > > But when PostgreSQL instances share storage rather than replicate: > - Consistency seems maintained (same data) > - Availability seems maintained (client can always promote an accessible > node) > - Partitions between PostgreSQL nodes don't prevent the system from > functioning > > It seems that CAP assumes specific implementation details (like nodes > maintaining independent state) without explicitly stating them. > > How should we think about CAP theorem when distributed nodes share storag= e > rather than coordinate state? Are the trade-offs simply moved to a > different layer, or does shared storage fundamentally change the analysis= ? > > Client with awareness of both PostgreSQL nodes > | | > =E2=86=93 (partition here) =E2=86=93 > PostgreSQL Primary PostgreSQL Standby > | | > =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =98 > =E2=86=93 > Shared ZFS Pool > | > 6 Global ZeroFS instances > > Best, > Pierre > > On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote: > > Hi Seref, > > > > For the benchmarks, I used Hetzner's cloud service with the following > setup: > > > > - A Hetzner s3 bucket in the FSN1 region > > - A virtual machine of type ccx63 48 vCPU 192 GB memory > > - 3 ZeroFS nbd devices (same s3 bucket) > > - A ZFS stripped pool with the 3 devices > > - 200GB zfs L2ARC > > - Postgres configured accordingly memory-wise as well as with > synchronous_commit =3D off, wal_init_zero =3D off and wal_recycle =3D off= . > > > > Best, > > Pierre > > > > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: > >> Sorry, this was meant to go to the whole group: > >> > >> Very interesting!. Great work. Can you clarify how exactly you're > running postgres in your tests? A specific AWS service? What's the test > infrastructure that sits above the file system? > >> > >> On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre wrote: > >>> Hi everyone, > >>> > >>> I wanted to share a project I've been working on that enables > PostgreSQL to run on S3 storage while maintaining performance comparable = to > local NVMe. The approach uses block-level access rather than trying to ma= p > filesystem operations to S3 objects. > >>> > >>> ZeroFS: https://github.com/Barre/ZeroFS > >>> > >>> # The Architecture > >>> > >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 > storage as raw block devices. PostgreSQL runs unmodified on ZFS pools bui= lt > on these block devices: > >>> > >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 > >>> > >>> By providing block-level access and leveraging ZFS's caching > capabilities (L2ARC), we can achieve microsecond latencies despite the > underlying storage being in S3. > >>> > >>> ## Performance Results > >>> > >>> Here are pgbench results from PostgreSQL running on this setup: > >>> > >>> ### Read/Write Workload > >>> > >>> ``` > >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 > example > >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) > >>> starting vacuum...end. > >>> transaction type: > >>> scaling factor: 50 > >>> query mode: simple > >>> number of clients: 50 > >>> number of threads: 15 > >>> maximum number of tries: 1 > >>> number of transactions per client: 100000 > >>> number of transactions actually processed: 5000000/5000000 > >>> number of failed transactions: 0 (0.000%) > >>> latency average =3D 0.943 ms > >>> initial connection time =3D 48.043 ms > >>> tps =3D 53041.006947 (without initial connection time) > >>> ``` > >>> > >>> ### Read-Only Workload > >>> > >>> ``` > >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S > example > >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) > >>> starting vacuum...end. > >>> transaction type: > >>> scaling factor: 50 > >>> query mode: simple > >>> number of clients: 50 > >>> number of threads: 15 > >>> maximum number of tries: 1 > >>> number of transactions per client: 100000 > >>> number of transactions actually processed: 5000000/5000000 > >>> number of failed transactions: 0 (0.000%) > >>> latency average =3D 0.121 ms > >>> initial connection time =3D 53.358 ms > >>> tps =3D 413436.248089 (without initial connection time) > >>> ``` > >>> > >>> These numbers are with 50 concurrent clients and the actual data > stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory cache= s, > while cold data comes from S3. > >>> > >>> ## How It Works > >>> > >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS > can use like any other block device > >>> 2. Multiple cache layers hide S3 latency: > >>> a. ZFS ARC/L2ARC for frequently accessed blocks > >>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD > devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other bloc= k > device > >>> c. Optional local disk cache > >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 > >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' > LSM-tree > >>> > >>> ## Geo-Distributed PostgreSQL > >>> > >>> Since each region can run its own ZeroFS instance, you can create > geographically distributed PostgreSQL setups. > >>> > >>> Example architectures: > >>> > >>> Architecture 1 > >>> > >>> > >>> PostgreSQL Client > >>> | > >>> | SQL queries > >>> | > >>> +--------------+ > >>> | PG Proxy | > >>> | (HAProxy/ | > >>> | PgBouncer) | > >>> +--------------+ > >>> / \ > >>> / \ > >>> Synchronous Synchronous > >>> Replication Replication > >>> / \ > >>> / \ > >>> +---------------+ +---------------+ > >>> | PostgreSQL 1 | | PostgreSQL 2 | > >>> | (Primary) |=E2=97=84------=E2=96=BA| (Standby) = | > >>> +---------------+ +---------------+ > >>> | | > >>> | POSIX filesystem ops | > >>> | | > >>> +---------------+ +---------------+ > >>> | ZFS Pool 1 | | ZFS Pool 2 | > >>> | (3-way mirror)| | (3-way mirror)| > >>> +---------------+ +---------------+ > >>> / | \ / | \ > >>> / | \ / | \ > >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 > >>> | | | | | | > >>> +--------++--------++--------++--------++--------++--------+ > >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| > >>> +--------++--------++--------++--------++--------++--------+ > >>> | | | | | | > >>> | | | | | | > >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 > S3-Region6 > >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) > >>> > >>> Architecture 2: > >>> > >>> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Standby (= Region 2) > >>> \ / > >>> \ / > >>> Same ZFS Pool (NBD) > >>> | > >>> 6 Global ZeroFS > >>> | > >>> S3 Regions > >>> > >>> > >>> The main advantages I see are: > >>> 1. Dramatic cost reduction for large datasets > >>> 2. Simplified geo-distribution > >>> 3. Infinite storage capacity > >>> 4. Built-in encryption and compression > >>> > >>> Looking forward to your feedback and questions! > >>> > >>> Best, > >>> Pierre > >>> > >>> P.S. The full project includes a custom NFS filesystem too. > >>> > > > > > --0000000000008a5aa8063ad035b0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Sorry, I was referring to this:

>=C2=A0=C2= =A0But when PostgreSQL instances share s= torage rather than replicate:
> - Consistency seems maintained (same data= )
> - Availability seems maintained (client can always promote an accessi= ble node)
> - Partitions between PostgreSQL nodes don't prevent the s= ystem from functioning

Some pretty well-known cases of storage / compute sepa= ration (Aurora, Neon) also share the storage between instances,
that'= ;s why I'm a bit confused by your reply. I thought you're thinking = about this approach too, that's why I mentioned what kind of challenges= one may have on that path.


On Sat, Jul 26, 2025 at 12:36=E2=80=AFAM Pierre Barre &= lt;pierre@barre.sh> wrote:
What = you describe doesn=E2=80=99t look like something very useful for the vast m= ajority of projects that needs a database. Why would you even want that if = you can avoid it?=C2=A0

If your =E2=80=9Csingle no= de=E2=80=9D can handle tens / hundreds of thousands requests per second, st= ill have very durable and highly available storage, as well as fast recover= y mechanisms, what=E2=80=99s the point?

I am not t= rying to cater to extreme outliers that may want very weird like this, that= =E2=80=99s just not the use-cases I want to address, because I believe they= are few and far between.

Best,
Pierre= =C2=A0

On Sat, Jul 26, 2025, at 08:57, Vladimir Ch= uryukin wrote:
A shared storage would require a lot of extra work.= That's essentially what AWS Aurora does.
You will have to ha= ve functionality to sync in-memory states between nodes, because all the in= stances will have cached data that can easily become stale on any write ope= ration.
That alone is not that simple. You will have to modify so= me locking logic. Most likely do a lot of other changes in a lot of places,= Postgres was not just built with the assumption that the storage can be sh= ared.

-Vladimir

On Fri, Jul 18, 2025 at 5:31=E2=80=AFAM Pierre Barre <pierre@barre.sh> wr= ote:
Now, I'm trying to understan= d how CAP theorem applies here. Traditional PostgreSQL replication has clea= r CAP trade-offs - you choose between consistency and availability during p= artitions.

But when PostgreSQL instances share s= torage rather than replicate:
- Consistency seems maintained (sa= me data)
- Availability seems maintained (client can always prom= ote an accessible node)
- Partitions between PostgreSQL nodes do= n't prevent the system from functioning

It s= eems that CAP assumes specific implementation details (like nodes maintaini= ng independent state) without explicitly stating them.

How should we think about CAP theorem when distributed nodes share s= torage rather than coordinate state? Are the trade-offs simply moved to a d= ifferent layer, or does shared storage fundamentally change the analysis?

Client with awareness of both PostgreSQL nodes
=C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|
= =C2=A0 =C2=A0 =E2=86=93 (partition here)=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =E2=86=93
PostgreSQL Primary=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 PostgreSQL Standby
=C2=A0 =C2=A0 |= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|
=C2=A0 =C2=A0 =E2=94=94= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =E2=86=93
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Shared ZFS Pool
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A06 Global ZeroFS instances

Be= st,
Pierre

On Fri, Jul 18, 2025, at 1= 2:57, Pierre Barre wrote:
> Hi Seref,
>
<= div> > For the benchmarks, I used Hetzner's cloud service with the f= ollowing setup:
>
> - A Hetzner s3 bucket in t= he FSN1 region
> - A virtual machine of type ccx63 48 vCPU 19= 2 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB z= fs L2ARC
> - Postgres configured accordingly memory-wise as w= ell as with synchronous_commit =3D off, wal_init_zero =3D off and wal_recyc= le =3D off.
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at 12:42, Seref Arika= n wrote:
>> Sorry, this was meant to go to the whole group= :
>>
>> Very interesting!. Great work. C= an you clarify how exactly you're running postgres in your tests? A spe= cific AWS service? What's the test infrastructure that sits above the f= ile system?
>>
>> On Thu, Jul 17, 2025 a= t 11:59=E2=80=AFPM Pierre Barre <pierre@barre.sh> wrote:
>>> Hi eve= ryone,
>>>
>>> I wanted to share a= project I've been working on that enables PostgreSQL to run on S3 stor= age while maintaining performance comparable to local NVMe. The approach us= es block-level access rather than trying to map filesystem operations to S3= objects.
>>>
>>>
>= >> # The Architecture
>>>
>>>= ; ZeroFS provides NBD (Network Block Device) servers that expose S3 storage= as raw block devices. PostgreSQL runs unmodified on ZFS pools built on the= se block devices:
>>>
>>> PostgreS= QL -> ZFS -> NBD -> ZeroFS -> S3
>>>
=
>>> By providing block-level access and leveraging ZFS's= caching capabilities (L2ARC), we can achieve microsecond latencies despite= the underlying storage being in S3.
>>>
&g= t;>> ## Performance Results
>>>
>&= gt;> Here are pgbench results from PostgreSQL running on this setup:
>>>
>>> ### Read/Write Workload
=
>>>
>>> ```
>>> pos= tgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
=
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
= >>> starting vacuum...end.
>>> transaction ty= pe: <builtin: TPC-B (sort of)>
>>> scaling factor= : 50
>>> query mode: simple
>>> num= ber of clients: 50
>>> number of threads: 15
= >>> maximum number of tries: 1
>>> number of = transactions per client: 100000
>>> number of transacti= ons actually processed: 5000000/5000000
>>> number of f= ailed transactions: 0 (0.000%)
>>> latency average =3D = 0.943 ms
>>> initial connection time =3D 48.043 ms
>>> tps =3D 53041.006947 (without initial connection time)<= /div>
>>> ```
>>>
>>>= ; ### Read-Only Workload
>>>
>>> `= ``
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50= -j 15 -t 100000 -S example
>>> pgbench (16.9 (Ubuntu 1= 6.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
=
>>> transaction type: <builtin: select only>
>>> scaling factor: 50
>>> query mode: simpl= e
>>> number of clients: 50
>>> num= ber of threads: 15
>>> maximum number of tries: 1
=
>>> number of transactions per client: 100000
>= ;>> number of transactions actually processed: 5000000/5000000
<= div> >>> number of failed transactions: 0 (0.000%)
>= >> latency average =3D 0.121 ms
>>> initial conne= ction time =3D 53.358 ms
>>> tps =3D 413436.248089 (wit= hout initial connection time)
>>> ```
>&g= t;>
>>> These numbers are with 50 concurrent client= s and the actual data stored in S3. Hot data is served from ZFS L2ARC and Z= eroFS's memory caches, while cold data comes from S3.
>&g= t;>
>>> ## How It Works
>>>
>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that = PostgreSQL/ZFS can use like any other block device
>>> = 2. Multiple cache layers hide S3 latency:
>>>=C2=A0 =C2= =A0 a. ZFS ARC/L2ARC for frequently accessed blocks
>>>= =C2=A0 =C2=A0 b. ZeroFS memory cache for metadata and hot dataZeroFS expose= s NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other = block device
>>>=C2=A0 =C2=A0 c. Optional local disk ca= che
>>> 3. All data is encrypted (ChaCha20-Poly1305) be= fore hitting S3
>>> 4. Files are split into 128KB chunk= s for insertion into ZeroFS' LSM-tree
>>>
>>> ## Geo-Distributed PostgreSQL
>>>
>>> Since each region can run its own ZeroFS instance, you = can create geographically distributed PostgreSQL setups.
>>= ;>
>>> Example architectures:
>>&g= t;
>>> Architecture 1
>>>
>>>
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 PostgreSQL Client<= /div>
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 | SQL queries
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 |
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-= -------------+
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2= =A0 PG Proxy=C2=A0 =C2=A0 |
>>>=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0| (HAProxy/=C2=A0 =C2=A0 |
>>>=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0|=C2=A0 PgBouncer)=C2=A0 |
>>>=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0+--------------+
>>>=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 /=C2=A0 =C2=A0 =C2=A0 =C2=A0 \
>>= >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 \
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 Synchronous=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 Synchronous
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Replication=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 Replication
>>>=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0/=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \
>&g= t;>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 /=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 \
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0+---------------+=C2=A0 =C2=A0 =C2=A0 =C2=A0 +------------= ---+
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0| PostgreSQL 1=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 | PostgreSQL 2= =C2=A0 |
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0| (Primary)=C2=A0 =C2=A0 =C2=A0|=E2=97=84------=E2=96=BA| (Sta= ndby)=C2=A0 =C2=A0 =C2=A0|
>>>=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+---------------+=C2=A0 =C2=A0 =C2=A0 =C2=A0= +---------------+
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |
= >>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0|=C2=A0 POSIX filesystem ops=C2=A0 |
>&g= t;>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 |
>>>=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+---------------+=C2=A0 =C2=A0 =C2=A0 =C2=A0= +---------------+
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0ZFS Pool 1=C2=A0 |=C2=A0 =C2=A0 =C2= =A0 =C2=A0 |=C2=A0 =C2=A0ZFS Pool 2=C2=A0 |
>>>=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| (3-way mirror)|=C2=A0 =C2= =A0 =C2=A0 =C2=A0 | (3-way mirror)|
>>>=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+---------------+=C2=A0 =C2=A0 =C2= =A0 =C2=A0 +---------------+
>>>=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /=C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2= =A0 \=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /=C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2= =A0 =C2=A0 \
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0/=C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0\= =C2=A0 =C2=A0 =C2=A0 =C2=A0 /=C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2= =A0 =C2=A0\
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0NBD:10= 809 NBD:10810 NBD:10811=C2=A0 NBD:10812 NBD:10813 NBD:10814
>= >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2= =A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 |
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+--------++--------++-----= ---++--------++--------++--------+
>>>=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS = 6|
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+--------++----= ----++--------++--------++--------++--------+
>>>=C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0|
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(us-east) (eu-west) (= ap-south) (us-west) (eu-north) (ap-east)
>>>
>>> Architecture 2:
>>>
>>= > PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Standby (R= egion 2)
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0\=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 /
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 /
>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Same ZFS Pool (NBD)
>= ;>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 |
>>>=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A06 Global ZeroFS
>= ;>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 |
>>>=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0S3 Regions
=
>>>
>>>
>>> The ma= in advantages I see are:
>>> 1. Dramatic cost reduction= for large datasets
>>> 2. Simplified geo-distribution<= /div>
>>> 3. Infinite storage capacity
>>>= ; 4. Built-in encryption and compression
>>>
>>> Looking forward to your feedback and questions!
&= gt;>>
>>> Best,
>>> Pierre
>>>
>>> P.S. The full project inclu= des a custom NFS filesystem too.
>>>
>


--0000000000008a5aa8063ad035b0--