Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ufZSM-0092OF-Az for pgsql-general@arkaria.postgresql.org; Sat, 26 Jul 2025 07:36:35 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ufZSL-007a4V-Ff for pgsql-general@arkaria.postgresql.org; Sat, 26 Jul 2025 07:36:33 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ufZSK-007ZxG-Fq for pgsql-general@lists.postgresql.org; Sat, 26 Jul 2025 07:36:33 +0000 Received: from fhigh-b3-smtp.messagingengine.com ([202.12.124.154]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1ufZSH-000qTp-2M for pgsql-general@lists.postgresql.org; Sat, 26 Jul 2025 07:36:31 +0000 Received: from phl-compute-01.internal (phl-compute-01.phl.internal [10.202.2.41]) by mailfhigh.stl.internal (Postfix) with ESMTP id 3EA4F7A06D4; Sat, 26 Jul 2025 03:36:28 -0400 (EDT) Received: from phl-imap-04 ([10.202.2.82]) by phl-compute-01.internal (MEProxy); Sat, 26 Jul 2025 03:36:28 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=barre.sh; h=cc :cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1753515388; x=1753601788; bh=mWjlxE5UCl ONfeQTI7GwVwPyNAi+0pcVlrfeUso38fo=; b=UmPUDHvVlWSLrw3vJr536lCwBW g/86ajyHXxPYnF/KJPdyBjKTWFZmKHBB7sXZk5B9qQVceSKJMguZ7/UwwRVS5bCu oKHz9eg8CCRYpkvTYAzzJDh7XeaTtfy5xoVlJlWTY5LWLNY73YmiemqNb52zbHwn u7ztRiFvJvkdE5z995R6iRqoKr2jNSaCqawOXxMMZow6ZPpK/B3Fle4NqcJS12FC If25zTxDTRbqvIyl1wXjRhUz26faU5lUrNp9VTsImVGxw4opRG6lFPpx/zoS+Qed ghw1pt5+POOTPkqoZmU5Iupn8Hb8k5mOrxUuKLzPchfABkb0qOWTWyM6PRHg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1753515388; x=1753601788; bh=mWjlxE5UClONfeQTI7GwVwPyNAi+0pcVlrf eUso38fo=; b=K3RmaUQc1gc+7V2SqKvn1PxPh8uSQCcWQieE5sfBkcnvWDLZnau aTSJxVuMscUa2rV+wuKbH6RTGtP2MztSLtfmnti6b5vloktZlnumJ+6Z/k0Q3PHh KWCfFD2mSxvS9wmrkAiu1vu6C7iQTAdSM7dak+ZAlfqhrAZhu4r3XjcmAzqvLK97 aIS8a/xoKSn/rXEOEymKbyZDGhZFBfZRKvBJH3VRpaEf6L73dJssf0MSpFaXHkzW 9+PmYgFQaveiAPUDjMZfqAacT4aK/clTN3CpzNrhvDYJGqUIUnQyepbzEoAWPelo LzHF7Ci7Vaf/CTfBscViWcRsas8qPsVbjrg== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdekheekfecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpefoggffhffvvefkjghfufgtsegrtderreertdejnecuhfhrohhmpedfrfhivghrrhgv uceurghrrhgvfdcuoehpihgvrhhrvgessggrrhhrvgdrshhhqeenucggtffrrghtthgvrh hnpefhledvffevkeehueefffevudehjeevhfffkeeffefgffdviedtheehieeigfdujeen ucffohhmrghinhepghhithhhuhgsrdgtohhmnecuvehluhhsthgvrhfuihiivgeptdenuc frrghrrghmpehmrghilhhfrhhomhepphhivghrrhgvsegsrghrrhgvrdhshhdpnhgspghr tghpthhtohepvddpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepvhhlrgguihhmih hrsegthhhurhihuhhkihhnrdgtohhmpdhrtghpthhtohepphhgshhqlhdqghgvnhgvrhgr lheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: i97614980:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id 95E9AB6006C; Sat, 26 Jul 2025 03:36:27 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface MIME-Version: 1.0 X-ThreadId: T89c86ea8eb4c36ce Date: Sat, 26 Jul 2025 09:36:07 +0200 From: "Pierre Barre" To: "Vladimir Churyukin" Cc: pgsql-general@lists.postgresql.org Message-Id: <96edd171-9cbe-466d-b3d6-04e069cee419@app.fastmail.com> In-Reply-To: References: <8188513c-e089-4273-b2be-16dd0a5a0a80@app.fastmail.com> Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance Content-Type: multipart/alternative; boundary=6c0ad1abfee34a82af0a664aaa60efcc List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --6c0ad1abfee34a82af0a664aaa60efcc Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable What you describe doesn=E2=80=99t look like something very useful for th= e vast majority of projects that needs a database. Why would you even wa= nt that if you can avoid it?=20 If your =E2=80=9Csingle node=E2=80=9D can handle tens / hundreds of thou= sands requests per second, still have very durable and highly available = storage, as well as fast recovery mechanisms, what=E2=80=99s the point? I am not trying to cater to extreme outliers that may want very weird li= ke this, that=E2=80=99s just not the use-cases I want to address, becaus= e I believe they are few and far between. Best, Pierre=20 On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote: > A shared storage would require a lot of extra work. That's essentially= what AWS Aurora does. > You will have to have functionality to sync in-memory states between n= odes, because all the instances will have cached data that can easily be= come stale on any write operation. > That alone is not that simple. You will have to modify some locking lo= gic. Most likely do a lot of other changes in a lot of places, Postgres = was not just built with the assumption that the storage can be shared. >=20 > -Vladimir >=20 > On Fri, Jul 18, 2025 at 5:31=E2=80=AFAM Pierre Barre = wrote: >> Now, I'm trying to understand how CAP theorem applies here. Tradition= al PostgreSQL replication has clear CAP trade-offs - you choose between = consistency and availability during partitions. >>=20 >> But when PostgreSQL instances share storage rather than replicate: >> - Consistency seems maintained (same data) >> - Availability seems maintained (client can always promote an accessi= ble node) >> - Partitions between PostgreSQL nodes don't prevent the system from f= unctioning >>=20 >> It seems that CAP assumes specific implementation details (like nodes= maintaining independent state) without explicitly stating them. >>=20 >> How should we think about CAP theorem when distributed nodes share st= orage rather than coordinate state? Are the trade-offs simply moved to a= different layer, or does shared storage fundamentally change the analys= is? >>=20 >> Client with awareness of both PostgreSQL nodes >> | | >> =E2=86=93 (partition here) =E2=86=93 >> PostgreSQL Primary PostgreSQL Standby >> | | >> =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=98 >> =E2=86=93 >> Shared ZFS Pool >> | >> 6 Global ZeroFS instances >>=20 >> Best, >> Pierre >>=20 >> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote: >> > Hi Seref, >> >=20 >> > For the benchmarks, I used Hetzner's cloud service with the followi= ng setup: >> >=20 >> > - A Hetzner s3 bucket in the FSN1 region >> > - A virtual machine of type ccx63 48 vCPU 192 GB memory >> > - 3 ZeroFS nbd devices (same s3 bucket) >> > - A ZFS stripped pool with the 3 devices >> > - 200GB zfs L2ARC >> > - Postgres configured accordingly memory-wise as well as with synch= ronous_commit =3D off, wal_init_zero =3D off and wal_recycle =3D off. >> >=20 >> > Best, >> > Pierre >> >=20 >> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: >> >> Sorry, this was meant to go to the whole group: >> >>=20 >> >> Very interesting!. Great work. Can you clarify how exactly you're = running postgres in your tests? A specific AWS service? What's the test = infrastructure that sits above the file system? >> >>=20 >> >> On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre wrote: >> >>> Hi everyone, >> >>>=20 >> >>> I wanted to share a project I've been working on that enables Pos= tgreSQL to run on S3 storage while maintaining performance comparable to= local NVMe. The approach uses block-level access rather than trying to = map filesystem operations to S3 objects. >> >>>=20 >> >>> ZeroFS: https://github.com/Barre/ZeroFS >> >>>=20 >> >>> # The Architecture >> >>>=20 >> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3= storage as raw block devices. PostgreSQL runs unmodified on ZFS pools b= uilt on these block devices: >> >>>=20 >> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >> >>>=20 >> >>> By providing block-level access and leveraging ZFS's caching capa= bilities (L2ARC), we can achieve microsecond latencies despite the under= lying storage being in S3. >> >>>=20 >> >>> ## Performance Results >> >>>=20 >> >>> Here are pgbench results from PostgreSQL running on this setup: >> >>>=20 >> >>> ### Read/Write Workload >> >>>=20 >> >>> ``` >> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 = example >> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> >>> starting vacuum...end. >> >>> transaction type: >> >>> scaling factor: 50 >> >>> query mode: simple >> >>> number of clients: 50 >> >>> number of threads: 15 >> >>> maximum number of tries: 1 >> >>> number of transactions per client: 100000 >> >>> number of transactions actually processed: 5000000/5000000 >> >>> number of failed transactions: 0 (0.000%) >> >>> latency average =3D 0.943 ms >> >>> initial connection time =3D 48.043 ms >> >>> tps =3D 53041.006947 (without initial connection time) >> >>> ``` >> >>>=20 >> >>> ### Read-Only Workload >> >>>=20 >> >>> ``` >> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 = -S example >> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> >>> starting vacuum...end. >> >>> transaction type: >> >>> scaling factor: 50 >> >>> query mode: simple >> >>> number of clients: 50 >> >>> number of threads: 15 >> >>> maximum number of tries: 1 >> >>> number of transactions per client: 100000 >> >>> number of transactions actually processed: 5000000/5000000 >> >>> number of failed transactions: 0 (0.000%) >> >>> latency average =3D 0.121 ms >> >>> initial connection time =3D 53.358 ms >> >>> tps =3D 413436.248089 (without initial connection time) >> >>> ``` >> >>>=20 >> >>> These numbers are with 50 concurrent clients and the actual data = stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory cach= es, while cold data comes from S3. >> >>>=20 >> >>> ## How It Works >> >>>=20 >> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/Z= FS can use like any other block device >> >>> 2. Multiple cache layers hide S3 latency: >> >>> a. ZFS ARC/L2ARC for frequently accessed blocks >> >>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes= NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any othe= r block device >> >>> c. Optional local disk cache >> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' L= SM-tree >> >>>=20 >> >>> ## Geo-Distributed PostgreSQL >> >>>=20 >> >>> Since each region can run its own ZeroFS instance, you can create= geographically distributed PostgreSQL setups. >> >>>=20 >> >>> Example architectures: >> >>>=20 >> >>> Architecture 1 >> >>>=20 >> >>>=20 >> >>> PostgreSQL Client >> >>> | >> >>> | SQL queries >> >>> | >> >>> +--------------+ >> >>> | PG Proxy | >> >>> | (HAProxy/ | >> >>> | PgBouncer) | >> >>> +--------------+ >> >>> / \ >> >>> / \ >> >>> Synchronous Synchronous >> >>> Replication Replication >> >>> / \ >> >>> / \ >> >>> +---------------+ +---------------+ >> >>> | PostgreSQL 1 | | PostgreSQL 2 | >> >>> | (Primary) |=E2=97=84------=E2=96=BA| (Standby= ) | >> >>> +---------------+ +---------------+ >> >>> | | >> >>> | POSIX filesystem ops | >> >>> | | >> >>> +---------------+ +---------------+ >> >>> | ZFS Pool 1 | | ZFS Pool 2 | >> >>> | (3-way mirror)| | (3-way mirror)| >> >>> +---------------+ +---------------+ >> >>> / | \ / | \ >> >>> / | \ / | \ >> >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10= 814 >> >>> | | | | | | >> >>> +--------++--------++--------++--------++--------++------= --+ >> >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS= 6| >> >>> +--------++--------++--------++--------++--------++------= --+ >> >>> | | | | | | >> >>> | | | | | | >> >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3= -Region6 >> >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-e= ast) >> >>>=20 >> >>> Architecture 2: >> >>>=20 >> >>> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Stand= by (Region 2) >> >>> \ / >> >>> \ / >> >>> Same ZFS Pool (NBD) >> >>> | >> >>> 6 Global ZeroFS >> >>> | >> >>> S3 Regions >> >>>=20 >> >>>=20 >> >>> The main advantages I see are: >> >>> 1. Dramatic cost reduction for large datasets >> >>> 2. Simplified geo-distribution >> >>> 3. Infinite storage capacity >> >>> 4. Built-in encryption and compression >> >>>=20 >> >>> Looking forward to your feedback and questions! >> >>>=20 >> >>> Best, >> >>> Pierre >> >>>=20 >> >>> P.S. The full project includes a custom NFS filesystem too. >> >>>=20 >> > >>=20 --6c0ad1abfee34a82af0a664aaa60efcc Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
What you des= cribe doesn=E2=80=99t look like something very useful for the vast major= ity of projects that needs a database. Why would you even want that if y= ou can avoid it? 

If your =E2=80=9Csingle = node=E2=80=9D can handle tens / hundreds of thousands requests per secon= d, still have very durable and highly available storage, as well as fast= recovery mechanisms, what=E2=80=99s the point?

I am not trying to cater to extreme outliers that may want very weird l= ike this, that=E2=80=99s just not the use-cases I want to address, becau= se I believe they are few and far between.

Best= ,
Pierre 

On Sat, Jul 26, 2025, = at 08:57, Vladimir Churyukin wrote:
A shared storage would require a l= ot of extra work. That's essentially what AWS Aurora does.
You= will have to have functionality to sync in-memory states between nodes,= because all the instances will have cached data that can easily become = stale on any write operation.
That alone is not that simple. Y= ou will have to modify some locking logic. Most likely do a lot of other= changes in a lot of places, Postgres was not just built with the assump= tion that the storage can be shared.

-Vladimir<= /div>

On Fri, Jul 18, 2025 a= t 5:31=E2=80=AFAM Pierre Barre <pi= erre@barre.sh> wrote:
Now, I'm trying to understand how= CAP theorem applies here. Traditional PostgreSQL replication has clear = CAP trade-offs - you choose between consistency and availability during = partitions.

But when PostgreSQL instances sha= re storage rather than replicate:
- Consistency seems maintai= ned (same data)
- Availability seems maintained (client can a= lways promote an accessible node)
- Partitions between Postgr= eSQL nodes don't prevent the system from functioning

It seems that CAP assumes specific implementation details (like = nodes maintaining independent state) without explicitly stating them.

How should we think about CAP theorem when dist= ributed nodes share storage rather than coordinate state? Are the trade-= offs simply moved to a different layer, or does shared storage fundament= ally change the analysis?

Client with awarene= ss of both PostgreSQL nodes
    |    &nbs= p;                    =      |
    =E2=86=93 (partition her= e)              =E2=86=93
= PostgreSQL Primary              Postg= reSQL Standby
    |        &nbs= p;                    =  |
    =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98
    &nbs= p;           =E2=86=93
   =      Shared ZFS Pool
      &nb= sp;         |
       =  6 Global ZeroFS instances

Best,
<= div> Pierre

On Fri, Jul 18, 2025, at 12:57, P= ierre Barre wrote:
> Hi Seref,
>
= > For the benchmarks, I used Hetzner's cloud service with the follow= ing setup:
>
> - A Hetzner s3 bucket in the= FSN1 region
> - A virtual machine of type ccx63 48 vCPU 1= 92 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - = 200GB zfs L2ARC
> - Postgres configured accordingly memory= -wise as well as with synchronous_commit =3D off, wal_init_zero =3D off = and wal_recycle =3D off.
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at= 12:42, Seref Arikan wrote:
>> Sorry, this was meant to= go to the whole group:
>>
>> Very in= teresting!. Great work. Can you clarify how exactly you're running postg= res in your tests? A specific AWS service? What's the test infrastructur= e that sits above the file system?
>>
>&= gt; On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre <pierre@barre.sh> wrote= :
>>> Hi everyone,
>>>
>>> I wanted to share a project I've been working on that en= ables PostgreSQL to run on S3 storage while maintaining performance comp= arable to local NVMe. The approach uses block-level access rather than t= rying to map filesystem operations to S3 objects.
>>>= ;
>>>
>>> # The Architect= ure
>>>
>>> ZeroFS provides NBD= (Network Block Device) servers that expose S3 storage as raw block devi= ces. PostgreSQL runs unmodified on ZFS pools built on these block device= s:
>>>
>>> PostgreSQL -> ZFS= -> NBD -> ZeroFS -> S3
>>>
>= ;>> By providing block-level access and leveraging ZFS's caching c= apabilities (L2ARC), we can achieve microsecond latencies despite the un= derlying storage being in S3.
>>>
>&g= t;> ## Performance Results
>>>
>&g= t;> Here are pgbench results from PostgreSQL running on this setup:
>>>
>>> ### Read/Write Workload<= /div>
>>>
>>> ```
>>= > postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 ex= ample
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.= 1))
>>> starting vacuum...end.
>>&g= t; transaction type: <builtin: TPC-B (sort of)>
>>= ;> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number= of threads: 15
>>> maximum number of tries: 1
=
>>> number of transactions per client: 100000
= >>> number of transactions actually processed: 5000000/5000000<= /div>
>>> number of failed transactions: 0 (0.000%)
<= div> >>> latency average =3D 0.943 ms
>>> i= nitial connection time =3D 48.043 ms
>>> tps =3D 530= 41.006947 (without initial connection time)
>>> ```<= /div>
>>>
>>> ### Read-Only Workload<= /div>
>>>
>>> ```
>>= > postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S= example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.= 04.1))
>>> starting vacuum...end.
>>= ;> transaction type: <builtin: select only>
>>= > scaling factor: 50
>>> query mode: simple
=
>>> number of clients: 50
>>> number = of threads: 15
>>> maximum number of tries: 1
<= div> >>> number of transactions per client: 100000
&= gt;>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average =3D 0.121 ms
>>> in= itial connection time =3D 53.358 ms
>>> tps =3D 4134= 36.248089 (without initial connection time)
>>> ```<= /div>
>>>
>>> These numbers are with = 50 concurrent clients and the actual data stored in S3. Hot data is serv= ed from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from= S3.
>>>
>>> ## How It Works
>>>
>>> 1. ZeroFS exposes NBD dev= ices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block = device
>>> 2. Multiple cache layers hide S3 latency:=
>>>    a. ZFS ARC/L2ARC for frequently ac= cessed blocks
>>>    b. ZeroFS memory cach= e for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) = that PostgreSQL/ZFS can use like any other block device
>&= gt;>    c. Optional local disk cache
>>>= ; 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>> 4. Files are split into 128KB chunks for insertion into= ZeroFS' LSM-tree
>>>
>>> ## Ge= o-Distributed PostgreSQL
>>>
>>>= ; Since each region can run its own ZeroFS instance, you can create geog= raphically distributed PostgreSQL setups.
>>>
=
>>> Example architectures:
>>>
=
>>> Architecture 1
>>>
&g= t;>>
>>>          &n= bsp;               PostgreSQL Client<= /div>
>>>              =                     &n= bsp; |
>>>            =                     &n= bsp;   | SQL queries
>>>      &n= bsp;                   &nbs= p;         |
>>>    &n= bsp;                   &nbs= p;    +--------------+
>>>    &n= bsp;                   &nbs= p;    |  PG Proxy    |
>>>&= nbsp;                   &nb= sp;        | (HAProxy/    |
>= ;>>                  =            |  PgBouncer)  |
>>>              &nbs= p;              +--------------+
>>>              &nbs= p;                 /   = ;     \
>>>        &nb= sp;                    = ;  /          \
>>>&nb= sp;                   Synch= ronous            Synchronous
&= gt;>>                 = ;   Replication            Replicatio= n
>>>             = ;                /    =           \
>>>   = ;                     =     /                \=
>>>             =  +---------------+        +---------------+
>>>              &n= bsp;| PostgreSQL 1  |        | PostgreSQL 2&nbs= p; |
>>>            &n= bsp;  | (Primary)     |=E2=97=84------=E2=96=BA| (St= andby)     |
>>>      =          +---------------+      =   +---------------+
>>>      &nb= sp;                |   = ;                     = |
>>>             = ;          |  POSIX filesystem ops  |=
>>>             =          |          &= nbsp;             |
>>>= ;               +---------------= +        +---------------+
>>>&n= bsp;              |   ZFS P= ool 1  |        |   ZFS Pool 2  = |
>>>             = ;  | (3-way mirror)|        | (3-way mirror)|
>>>              &= nbsp;+---------------+        +---------------+
>>>              &nbs= p; /      |      \      &nb= sp;   /      |      \
>= >>               /  &= nbsp;    |       \      &nb= sp; /       |       \
= >>>         NBD:10809 NBD:10810 NBD:1= 0811  NBD:10812 NBD:10813 NBD:10814
>>>  &= nbsp;           |        |&= nbsp;       |           |&n= bsp;       |        |
>= >>         +--------++--------++--------+= +--------++--------++--------+
>>>    &nbs= p;    |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroF= S 6|
>>>         +--------+= +--------++--------++--------++--------++--------+
>>&g= t;              |     =    |         |      &= nbsp;  |         |      &nb= sp;  |
>>>          &n= bsp;   |         |      &nb= sp;  |         |       = ;  |         |
>>>&nbs= p;        S3-Region1 S3-Region2 S3-Region3 S3-Region= 4 S3-Region5 S3-Region6
>>>      &nbs= p;  (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>
>>> Architecture 2:
>>>
>>> PostgreSQL Primary (Region 1) =E2= =86=90=E2=86=92 PostgreSQL Standby (Region 2)
>>>&nb= sp;                \   = ;                 /
&= gt;>>                 = ; \                  /
>>>              &nbs= p;    Same ZFS Pool (NBD)
>>>   =                     &= nbsp; |
>>>           =        6 Global ZeroFS
>>> = ;                     =     |
>>>         = ;              S3 Regions
= >>>
>>>
>>> The main = advantages I see are:
>>> 1. Dramatic cost reduction= for large datasets
>>> 2. Simplified geo-distributi= on
>>> 3. Infinite storage capacity
>&= gt;> 4. Built-in encryption and compression
>>> <= /div>
>>> Looking forward to your feedback and questions!<= /div>
>>>
>>> Best,
>&g= t;> Pierre
>>>
>>> P.S. The = full project includes a custom NFS filesystem too.
>>&g= t;
>

=

--6c0ad1abfee34a82af0a664aaa60efcc--