Feedback-ID: i97614980:Fastmail
MIME-Version: 1.0
Date: Sat, 26 Jul 2025 09:36:07 +0200
From: "Pierre Barre" <pierre@barre.sh>
To: "Vladimir Churyukin" <vladimir@churyukin.com>
Cc: pgsql-general@lists.postgresql.org
Message-Id: <96edd171-9cbe-466d-b3d6-04e069cee419@app.fastmail.com>
In-Reply-To: 
 <CAFSGpE2j29C0MntAe59oay0sm1W_htQFyAbfuJeEViJ0BN1Wyg@mail.gmail.com>
References: <a9fe5ddb-9685-4139-bc1f-88161a7a4da3@app.fastmail.com>
 <CAG1bHGOzCNtDeW0W8gRO7mpW=t7BqWh-iz4kX5VRCPgt_6Tr6Q@mail.gmail.com>
 <8188513c-e089-4273-b2be-16dd0a5a0a80@app.fastmail.com>
 <c5a52444-80cd-4e50-8fc4-a3a9bc09feb4@app.fastmail.com>
 <CAFSGpE2j29C0MntAe59oay0sm1W_htQFyAbfuJeEViJ0BN1Wyg@mail.gmail.com>
Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
Content-Type: multipart/alternative;
 boundary=6c0ad1abfee34a82af0a664aaa60efcc
Archived-At: <https://www.postgresql.org/message-id/96edd171-9cbe-466d-b3d6-04e069cee419%40app.fastmail.com>
Precedence: bulk

--6c0ad1abfee34a82af0a664aaa60efcc
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

What you describe doesn=E2=80=99t look like something very useful for th=
e vast majority of projects that needs a database. Why would you even wa=
nt that if you can avoid it?=20

If your =E2=80=9Csingle node=E2=80=9D can handle tens / hundreds of thou=
sands requests per second, still have very durable and highly available =
storage, as well as fast recovery mechanisms, what=E2=80=99s the point?

I am not trying to cater to extreme outliers that may want very weird li=
ke this, that=E2=80=99s just not the use-cases I want to address, becaus=
e I believe they are few and far between.

Best,
Pierre=20

On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
> A shared storage would require a lot of extra work. That's essentially=
 what AWS Aurora does.
> You will have to have functionality to sync in-memory states between n=
odes, because all the instances will have cached data that can easily be=
come stale on any write operation.
> That alone is not that simple. You will have to modify some locking lo=
gic. Most likely do a lot of other changes in a lot of places, Postgres =
was not just built with the assumption that the storage can be shared.
>=20
> -Vladimir
>=20
> On Fri, Jul 18, 2025 at 5:31=E2=80=AFAM Pierre Barre <pierre@barre.sh>=
 wrote:
>> Now, I'm trying to understand how CAP theorem applies here. Tradition=
al PostgreSQL replication has clear CAP trade-offs - you choose between =
consistency and availability during partitions.
>>=20
>> But when PostgreSQL instances share storage rather than replicate:
>> - Consistency seems maintained (same data)
>> - Availability seems maintained (client can always promote an accessi=
ble node)
>> - Partitions between PostgreSQL nodes don't prevent the system from f=
unctioning
>>=20
>> It seems that CAP assumes specific implementation details (like nodes=
 maintaining independent state) without explicitly stating them.
>>=20
>> How should we think about CAP theorem when distributed nodes share st=
orage rather than coordinate state? Are the trade-offs simply moved to a=
 different layer, or does shared storage fundamentally change the analys=
is?
>>=20
>> Client with awareness of both PostgreSQL nodes
>>     |                               |
>>     =E2=86=93 (partition here)              =E2=86=93
>> PostgreSQL Primary              PostgreSQL Standby
>>     |                               |
>>     =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=98
>>                 =E2=86=93
>>          Shared ZFS Pool
>>                 |
>>          6 Global ZeroFS instances
>>=20
>> Best,
>> Pierre
>>=20
>> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
>> > Hi Seref,
>> >=20
>> > For the benchmarks, I used Hetzner's cloud service with the followi=
ng setup:
>> >=20
>> > - A Hetzner s3 bucket in the FSN1 region
>> > - A virtual machine of type ccx63 48 vCPU 192 GB memory
>> > - 3 ZeroFS nbd devices (same s3 bucket)
>> > - A ZFS stripped pool with the 3 devices
>> > - 200GB zfs L2ARC
>> > - Postgres configured accordingly memory-wise as well as with synch=
ronous_commit =3D off, wal_init_zero =3D off and wal_recycle =3D off.
>> >=20
>> > Best,
>> > Pierre
>> >=20
>> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>> >> Sorry, this was meant to go to the whole group:
>> >>=20
>> >> Very interesting!. Great work. Can you clarify how exactly you're =
running postgres in your tests? A specific AWS service? What's the test =
infrastructure that sits above the file system?
>> >>=20
>> >> On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre <pierre@barr=
e.sh> wrote:
>> >>> Hi everyone,
>> >>>=20
>> >>> I wanted to share a project I've been working on that enables Pos=
tgreSQL to run on S3 storage while maintaining performance comparable to=
 local NVMe. The approach uses block-level access rather than trying to =
map filesystem operations to S3 objects.
>> >>>=20
>> >>> ZeroFS: https://github.com/Barre/ZeroFS
>> >>>=20
>> >>> # The Architecture
>> >>>=20
>> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3=
 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools b=
uilt on these block devices:
>> >>>=20
>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>> >>>=20
>> >>> By providing block-level access and leveraging ZFS's caching capa=
bilities (L2ARC), we can achieve microsecond latencies despite the under=
lying storage being in S3.
>> >>>=20
>> >>> ## Performance Results
>> >>>=20
>> >>> Here are pgbench results from PostgreSQL running on this setup:
>> >>>=20
>> >>> ### Read/Write Workload
>> >>>=20
>> >>> ```
>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 =
example
>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> >>> starting vacuum...end.
>> >>> transaction type: <builtin: TPC-B (sort of)>
>> >>> scaling factor: 50
>> >>> query mode: simple
>> >>> number of clients: 50
>> >>> number of threads: 15
>> >>> maximum number of tries: 1
>> >>> number of transactions per client: 100000
>> >>> number of transactions actually processed: 5000000/5000000
>> >>> number of failed transactions: 0 (0.000%)
>> >>> latency average =3D 0.943 ms
>> >>> initial connection time =3D 48.043 ms
>> >>> tps =3D 53041.006947 (without initial connection time)
>> >>> ```
>> >>>=20
>> >>> ### Read-Only Workload
>> >>>=20
>> >>> ```
>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 =
-S example
>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> >>> starting vacuum...end.
>> >>> transaction type: <builtin: select only>
>> >>> scaling factor: 50
>> >>> query mode: simple
>> >>> number of clients: 50
>> >>> number of threads: 15
>> >>> maximum number of tries: 1
>> >>> number of transactions per client: 100000
>> >>> number of transactions actually processed: 5000000/5000000
>> >>> number of failed transactions: 0 (0.000%)
>> >>> latency average =3D 0.121 ms
>> >>> initial connection time =3D 53.358 ms
>> >>> tps =3D 413436.248089 (without initial connection time)
>> >>> ```
>> >>>=20
>> >>> These numbers are with 50 concurrent clients and the actual data =
stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory cach=
es, while cold data comes from S3.
>> >>>=20
>> >>> ## How It Works
>> >>>=20
>> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/Z=
FS can use like any other block device
>> >>> 2. Multiple cache layers hide S3 latency:
>> >>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>> >>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes=
 NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any othe=
r block device
>> >>>    c. Optional local disk cache
>> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' L=
SM-tree
>> >>>=20
>> >>> ## Geo-Distributed PostgreSQL
>> >>>=20
>> >>> Since each region can run its own ZeroFS instance, you can create=
 geographically distributed PostgreSQL setups.
>> >>>=20
>> >>> Example architectures:
>> >>>=20
>> >>> Architecture 1
>> >>>=20
>> >>>=20
>> >>>                          PostgreSQL Client
>> >>>                                    |
>> >>>                                    | SQL queries
>> >>>                                    |
>> >>>                             +--------------+
>> >>>                             |  PG Proxy    |
>> >>>                             | (HAProxy/    |
>> >>>                             |  PgBouncer)  |
>> >>>                             +--------------+
>> >>>                                /        \
>> >>>                               /          \
>> >>>                    Synchronous            Synchronous
>> >>>                    Replication            Replication
>> >>>                             /              \
>> >>>                            /                \
>> >>>               +---------------+        +---------------+
>> >>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>> >>>               | (Primary)     |=E2=97=84------=E2=96=BA| (Standby=
)     |
>> >>>               +---------------+        +---------------+
>> >>>                       |                        |
>> >>>                       |  POSIX filesystem ops  |
>> >>>                       |                        |
>> >>>               +---------------+        +---------------+
>> >>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>> >>>               | (3-way mirror)|        | (3-way mirror)|
>> >>>               +---------------+        +---------------+
>> >>>                /      |      \          /      |      \
>> >>>               /       |       \        /       |       \
>> >>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10=
814
>> >>>              |        |        |           |        |        |
>> >>>         +--------++--------++--------++--------++--------++------=
--+
>> >>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS=
 6|
>> >>>         +--------++--------++--------++--------++--------++------=
--+
>> >>>              |         |         |         |         |         |
>> >>>              |         |         |         |         |         |
>> >>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3=
-Region6
>> >>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-e=
ast)
>> >>>=20
>> >>> Architecture 2:
>> >>>=20
>> >>> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Stand=
by (Region 2)
>> >>>                 \                    /
>> >>>                  \                  /
>> >>>                   Same ZFS Pool (NBD)
>> >>>                          |
>> >>>                   6 Global ZeroFS
>> >>>                          |
>> >>>                       S3 Regions
>> >>>=20
>> >>>=20
>> >>> The main advantages I see are:
>> >>> 1. Dramatic cost reduction for large datasets
>> >>> 2. Simplified geo-distribution
>> >>> 3. Infinite storage capacity
>> >>> 4. Built-in encryption and compression
>> >>>=20
>> >>> Looking forward to your feedback and questions!
>> >>>=20
>> >>> Best,
>> >>> Pierre
>> >>>=20
>> >>> P.S. The full project includes a custom NFS filesystem too.
>> >>>=20
>> >
>>=20

--6c0ad1abfee34a82af0a664aaa60efcc
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html><html><head><title></title></head><body><div>What you des=
cribe doesn=E2=80=99t look like something very useful for the vast major=
ity of projects that needs a database. Why would you even want that if y=
ou can avoid it?&nbsp;</div><div><br></div><div>If your =E2=80=9Csingle =
node=E2=80=9D can handle tens / hundreds of thousands requests per secon=
d, still have very durable and highly available storage, as well as fast=
 recovery mechanisms, what=E2=80=99s the point?</div><div><br></div><div=
>I am not trying to cater to extreme outliers that may want very weird l=
ike this, that=E2=80=99s just not the use-cases I want to address, becau=
se I believe they are few and far between.</div><div><br></div><div>Best=
,</div><div>Pierre&nbsp;</div><div><br></div><div>On Sat, Jul 26, 2025, =
at 08:57, Vladimir Churyukin wrote:</div><blockquote type=3D"cite" id=3D=
"qt" style=3D""><div dir=3D"ltr"><div>A shared storage would require a l=
ot of extra work. That's essentially what AWS Aurora does.</div><div>You=
 will have to have functionality to sync in-memory states between nodes,=
 because all the instances will have cached data that can easily become =
stale on any write operation.</div><div>That alone is not that simple. Y=
ou will have to modify some locking logic. Most likely do a lot of other=
 changes in a lot of places, Postgres was not just built with the assump=
tion that the storage can be shared.</div><div><br></div><div>-Vladimir<=
/div></div><div><br></div><div class=3D"qt-gmail_quote qt-gmail_quote_co=
ntainer"><div dir=3D"ltr" class=3D"qt-gmail_attr">On Fri, Jul 18, 2025 a=
t 5:31=E2=80=AFAM Pierre Barre &lt;<a href=3D"mailto:pierre@barre.sh">pi=
erre@barre.sh</a>&gt; wrote:</div><blockquote class=3D"qt-gmail_quote" s=
tyle=3D"margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.=
8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(=
204, 204, 204);padding-left:1ex;"><div>Now, I'm trying to understand how=
 CAP theorem applies here. Traditional PostgreSQL replication has clear =
CAP trade-offs - you choose between consistency and availability during =
partitions.</div><div> <br></div><div> But when PostgreSQL instances sha=
re storage rather than replicate:</div><div> - Consistency seems maintai=
ned (same data)</div><div> - Availability seems maintained (client can a=
lways promote an accessible node)</div><div> - Partitions between Postgr=
eSQL nodes don't prevent the system from functioning</div><div> <br></di=
v><div> It seems that CAP assumes specific implementation details (like =
nodes maintaining independent state) without explicitly stating them.</d=
iv><div> <br></div><div> How should we think about CAP theorem when dist=
ributed nodes share storage rather than coordinate state? Are the trade-=
offs simply moved to a different layer, or does shared storage fundament=
ally change the analysis?</div><div> <br></div><div> Client with awarene=
ss of both PostgreSQL nodes</div><div> &nbsp; &nbsp; |&nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp;|</div><div> &nbsp; &nbsp; =E2=86=93 (partition her=
e)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =E2=86=93</div><div> =
PostgreSQL Primary&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Postg=
reSQL Standby</div><div> &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp;|</div><div> &nbsp; &nbsp; =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98</div><div> &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =E2=86=93</div><div> &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp;Shared ZFS Pool</div><div> &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; |</div><div> &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp;6 Global ZeroFS instances</div><div> <br></div><div> Best,</div><=
div> Pierre</div><div> <br></div><div> On Fri, Jul 18, 2025, at 12:57, P=
ierre Barre wrote:</div><div> &gt; Hi Seref,</div><div> &gt; </div><div>=
 &gt; For the benchmarks, I used Hetzner's cloud service with the follow=
ing setup:</div><div> &gt; </div><div> &gt; - A Hetzner s3 bucket in the=
 FSN1 region</div><div> &gt; - A virtual machine of type ccx63 48 vCPU 1=
92 GB memory</div><div> &gt; - 3 ZeroFS nbd devices (same s3 bucket)</di=
v><div> &gt; - A ZFS stripped pool with the 3 devices</div><div> &gt; - =
200GB zfs L2ARC</div><div> &gt; - Postgres configured accordingly memory=
-wise as well as with synchronous_commit =3D off, wal_init_zero =3D off =
and wal_recycle =3D off.</div><div> &gt; </div><div> &gt; Best,</div><di=
v> &gt; Pierre</div><div> &gt; </div><div> &gt; On Fri, Jul 18, 2025, at=
 12:42, Seref Arikan wrote:</div><div> &gt;&gt; Sorry, this was meant to=
 go to the whole group:</div><div> &gt;&gt; </div><div> &gt;&gt; Very in=
teresting!. Great work. Can you clarify how exactly you're running postg=
res in your tests? A specific AWS service? What's the test infrastructur=
e that sits above the file system?</div><div> &gt;&gt; </div><div> &gt;&=
gt; On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre &lt;<a href=3D=
"mailto:pierre@barre.sh" target=3D"_blank">pierre@barre.sh</a>&gt; wrote=
:</div><div> &gt;&gt;&gt; Hi everyone,</div><div> &gt;&gt;&gt; </div><di=
v> &gt;&gt;&gt; I wanted to share a project I've been working on that en=
ables PostgreSQL to run on S3 storage while maintaining performance comp=
arable to local NVMe. The approach uses block-level access rather than t=
rying to map filesystem operations to S3 objects.</div><div> &gt;&gt;&gt=
; </div><div> &gt;&gt;&gt; ZeroFS: <a href=3D"https://github.com/Barre/Z=
eroFS" rel=3D"noreferrer" target=3D"_blank">https://github.com/Barre/Zer=
oFS</a></div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; # The Architect=
ure</div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; ZeroFS provides NBD=
 (Network Block Device) servers that expose S3 storage as raw block devi=
ces. PostgreSQL runs unmodified on ZFS pools built on these block device=
s:</div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; PostgreSQL -&gt; ZFS=
 -&gt; NBD -&gt; ZeroFS -&gt; S3</div><div> &gt;&gt;&gt; </div><div> &gt=
;&gt;&gt; By providing block-level access and leveraging ZFS's caching c=
apabilities (L2ARC), we can achieve microsecond latencies despite the un=
derlying storage being in S3.</div><div> &gt;&gt;&gt; </div><div> &gt;&g=
t;&gt; ## Performance Results</div><div> &gt;&gt;&gt; </div><div> &gt;&g=
t;&gt; Here are pgbench results from PostgreSQL running on this setup:</=
div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; ### Read/Write Workload<=
/div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; ```</div><div> &gt;&gt;=
&gt; postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 ex=
ample</div><div> &gt;&gt;&gt; pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.=
1))</div><div> &gt;&gt;&gt; starting vacuum...end.</div><div> &gt;&gt;&g=
t; transaction type: &lt;builtin: TPC-B (sort of)&gt;</div><div> &gt;&gt=
;&gt; scaling factor: 50</div><div> &gt;&gt;&gt; query mode: simple</div=
><div> &gt;&gt;&gt; number of clients: 50</div><div> &gt;&gt;&gt; number=
 of threads: 15</div><div> &gt;&gt;&gt; maximum number of tries: 1</div>=
<div> &gt;&gt;&gt; number of transactions per client: 100000</div><div> =
&gt;&gt;&gt; number of transactions actually processed: 5000000/5000000<=
/div><div> &gt;&gt;&gt; number of failed transactions: 0 (0.000%)</div><=
div> &gt;&gt;&gt; latency average =3D 0.943 ms</div><div> &gt;&gt;&gt; i=
nitial connection time =3D 48.043 ms</div><div> &gt;&gt;&gt; tps =3D 530=
41.006947 (without initial connection time)</div><div> &gt;&gt;&gt; ```<=
/div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; ### Read-Only Workload<=
/div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; ```</div><div> &gt;&gt;=
&gt; postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S=
 example</div><div> &gt;&gt;&gt; pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.=
04.1))</div><div> &gt;&gt;&gt; starting vacuum...end.</div><div> &gt;&gt=
;&gt; transaction type: &lt;builtin: select only&gt;</div><div> &gt;&gt;=
&gt; scaling factor: 50</div><div> &gt;&gt;&gt; query mode: simple</div>=
<div> &gt;&gt;&gt; number of clients: 50</div><div> &gt;&gt;&gt; number =
of threads: 15</div><div> &gt;&gt;&gt; maximum number of tries: 1</div><=
div> &gt;&gt;&gt; number of transactions per client: 100000</div><div> &=
gt;&gt;&gt; number of transactions actually processed: 5000000/5000000</=
div><div> &gt;&gt;&gt; number of failed transactions: 0 (0.000%)</div><d=
iv> &gt;&gt;&gt; latency average =3D 0.121 ms</div><div> &gt;&gt;&gt; in=
itial connection time =3D 53.358 ms</div><div> &gt;&gt;&gt; tps =3D 4134=
36.248089 (without initial connection time)</div><div> &gt;&gt;&gt; ```<=
/div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; These numbers are with =
50 concurrent clients and the actual data stored in S3. Hot data is serv=
ed from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from=
 S3.</div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; ## How It Works</d=
iv><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; 1. ZeroFS exposes NBD dev=
ices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block =
device</div><div> &gt;&gt;&gt; 2. Multiple cache layers hide S3 latency:=
</div><div> &gt;&gt;&gt;&nbsp; &nbsp; a. ZFS ARC/L2ARC for frequently ac=
cessed blocks</div><div> &gt;&gt;&gt;&nbsp; &nbsp; b. ZeroFS memory cach=
e for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) =
that PostgreSQL/ZFS can use like any other block device</div><div> &gt;&=
gt;&gt;&nbsp; &nbsp; c. Optional local disk cache</div><div> &gt;&gt;&gt=
; 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3</div><d=
iv> &gt;&gt;&gt; 4. Files are split into 128KB chunks for insertion into=
 ZeroFS' LSM-tree</div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; ## Ge=
o-Distributed PostgreSQL</div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt=
; Since each region can run its own ZeroFS instance, you can create geog=
raphically distributed PostgreSQL setups.</div><div> &gt;&gt;&gt; </div>=
<div> &gt;&gt;&gt; Example architectures:</div><div> &gt;&gt;&gt; </div>=
<div> &gt;&gt;&gt; Architecture 1</div><div> &gt;&gt;&gt; </div><div> &g=
t;&gt;&gt; </div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; PostgreSQL Client<=
/div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; |</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; | SQL queries</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; |</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp;+--------------+</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp;|&nbsp; PG Proxy&nbsp; &nbsp; |</div><div> &gt;&gt;&gt;&=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp;| (HAProxy/&nbsp; &nbsp; |</div><div> &gt=
;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; PgBouncer)&nbsp; |</div=
><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+--------------+</div=
><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; /&nbsp; &nbsp=
; &nbsp; &nbsp; \</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp;/&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \</div><div> &gt;&gt;&gt;&nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Synch=
ronous&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Synchronous</div><div> &=
gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp; Replication&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Replicatio=
n</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;/&nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \</div><div> &gt;&gt;&gt;&nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; /&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \=
</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp;+---------------+&nbsp; &nbsp; &nbsp; &nbsp; +---------------+</d=
iv><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp;| PostgreSQL 1&nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; | PostgreSQL 2&nbs=
p; |</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp;| (Primary)&nbsp; &nbsp; &nbsp;|=E2=97=84------=E2=96=BA| (St=
andby)&nbsp; &nbsp; &nbsp;|</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+---------------+&nbsp; &nbsp; &nbsp; =
&nbsp; +---------------+</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
|</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; POSIX filesystem ops&nbsp; |=
</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |</div><div> &gt;&gt;&gt=
;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+---------------=
+&nbsp; &nbsp; &nbsp; &nbsp; +---------------+</div><div> &gt;&gt;&gt;&n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp;ZFS P=
ool 1&nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp;ZFS Pool 2&nbsp; =
|</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp;| (3-way mirror)|&nbsp; &nbsp; &nbsp; &nbsp; | (3-way mirror)|</=
div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp;+---------------+&nbsp; &nbsp; &nbsp; &nbsp; +---------------+</div=
><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; /&nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; \&nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; /&nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; \</div><div> &gt;=
&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;/&nbsp; &=
nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp;\&nbsp; &nbsp; &nbsp; &nb=
sp; /&nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp;\</div><div>=
 &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NBD:10809 NBD:10810 NBD:1=
0811&nbsp; NBD:10812 NBD:10813 NBD:10814</div><div> &gt;&gt;&gt;&nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; |&=
nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&n=
bsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; |</div><div> &gt;=
&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+--------++--------++--------+=
+--------++--------++--------+</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp;|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroF=
S 6|</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+--------+=
+--------++--------++--------++--------++--------+</div><div> &gt;&gt;&g=
t;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &=
nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp;|</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|</div><div> &gt;&gt;&gt;&nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp;S3-Region1 S3-Region2 S3-Region3 S3-Region=
4 S3-Region5 S3-Region6</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp;(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)</=
div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; Architecture 2:</div><di=
v> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; PostgreSQL Primary (Region 1) =E2=
=86=90=E2=86=92 PostgreSQL Standby (Region 2)</div><div> &gt;&gt;&gt;&nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\&nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; /</div><div> &=
gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; \&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; /</div=
><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp;Same ZFS Pool (NBD)</div><div> &gt;&gt;&gt;&nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp; |</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp;6 Global ZeroFS</div><div> &gt;&gt;&gt;&nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; |</div><div> &gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;S3 Regions</div><div> =
&gt;&gt;&gt; </div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; The main =
advantages I see are:</div><div> &gt;&gt;&gt; 1. Dramatic cost reduction=
 for large datasets</div><div> &gt;&gt;&gt; 2. Simplified geo-distributi=
on</div><div> &gt;&gt;&gt; 3. Infinite storage capacity</div><div> &gt;&=
gt;&gt; 4. Built-in encryption and compression</div><div> &gt;&gt;&gt; <=
/div><div> &gt;&gt;&gt; Looking forward to your feedback and questions!<=
/div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; Best,</div><div> &gt;&g=
t;&gt; Pierre</div><div> &gt;&gt;&gt; </div><div> &gt;&gt;&gt; P.S. The =
full project includes a custom NFS filesystem too.</div><div> &gt;&gt;&g=
t; </div><div> &gt;</div><div><br></div></blockquote></div></blockquote>=
<div><br></div></body></html>
--6c0ad1abfee34a82af0a664aaa60efcc--