Feedback-ID: i97614980:Fastmail
MIME-Version: 1.0
Date: Sat, 26 Jul 2025 09:51:15 +0200
From: "Pierre Barre" <pierre@barre.sh>
To: "Vladimir Churyukin" <vladimir@churyukin.com>
Cc: pgsql-general@lists.postgresql.org
Message-Id: <44dafe90-9ad6-41ae-b9fe-bea4aaf49a59@app.fastmail.com>
In-Reply-To: 
 <CAFSGpE2xzAz4zefZa8sQLkNajp0hT7LiONQDGSAxigwGG3ii8w@mail.gmail.com>
References: <a9fe5ddb-9685-4139-bc1f-88161a7a4da3@app.fastmail.com>
 <CAG1bHGOzCNtDeW0W8gRO7mpW=t7BqWh-iz4kX5VRCPgt_6Tr6Q@mail.gmail.com>
 <8188513c-e089-4273-b2be-16dd0a5a0a80@app.fastmail.com>
 <c5a52444-80cd-4e50-8fc4-a3a9bc09feb4@app.fastmail.com>
 <CAFSGpE2j29C0MntAe59oay0sm1W_htQFyAbfuJeEViJ0BN1Wyg@mail.gmail.com>
 <96edd171-9cbe-466d-b3d6-04e069cee419@app.fastmail.com>
 <CAFSGpE2xzAz4zefZa8sQLkNajp0hT7LiONQDGSAxigwGG3ii8w@mail.gmail.com>
Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
Content-Type: multipart/alternative;
 boundary=d5aa1bab362c4f96945cfcf1af96a23b
Archived-At: <https://www.postgresql.org/message-id/44dafe90-9ad6-41ae-b9fe-bea4aaf49a59%40app.fastmail.com>
Precedence: bulk

--d5aa1bab362c4f96945cfcf1af96a23b
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Ah, by "shared storage" I mean that each node can acquire exclusivity, n=
ot that they can both R/W to it at the same time.

> Some pretty well-known cases of storage / compute separation (Aurora, =
Neon) also share the storage between instances,

That model is cool, but I think it's more of a solution for outliers as =
I was suggesting, not something that most would or should want.

Best,
Pierre

On Sat, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote:
> Sorry, I was referring to this:
>=20
> >  But when PostgreSQL instances share storage rather than replicate:
> > - Consistency seems maintained (same data)
> > - Availability seems maintained (client can always promote an access=
ible node)
> > - Partitions between PostgreSQL nodes don't prevent the system from =
functioning
>=20
> Some pretty well-known cases of storage / compute separation (Aurora, =
Neon) also share the storage between instances,
> that's why I'm a bit confused by your reply. I thought you're thinking=
 about this approach too, that's why I mentioned what kind of challenges=
 one may have on that path.
>=20
>=20
> On Sat, Jul 26, 2025 at 12:36=E2=80=AFAM Pierre Barre <pierre@barre.sh=
> wrote:
>> __
>> What you describe doesn=E2=80=99t look like something very useful for=
 the vast majority of projects that needs a database. Why would you even=
 want that if you can avoid it?=20
>>=20
>> If your =E2=80=9Csingle node=E2=80=9D can handle tens / hundreds of t=
housands requests per second, still have very durable and highly availab=
le storage, as well as fast recovery mechanisms, what=E2=80=99s the poin=
t?
>>=20
>> I am not trying to cater to extreme outliers that may want very weird=
 like this, that=E2=80=99s just not the use-cases I want to address, bec=
ause I believe they are few and far between.
>>=20
>> Best,
>> Pierre=20
>>=20
>> On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
>>> A shared storage would require a lot of extra work. That's essential=
ly what AWS Aurora does.
>>> You will have to have functionality to sync in-memory states between=
 nodes, because all the instances will have cached data that can easily =
become stale on any write operation.
>>> That alone is not that simple. You will have to modify some locking =
logic. Most likely do a lot of other changes in a lot of places, Postgre=
s was not just built with the assumption that the storage can be shared.
>>>=20
>>> -Vladimir
>>>=20
>>> On Fri, Jul 18, 2025 at 5:31=E2=80=AFAM Pierre Barre <pierre@barre.s=
h> wrote:
>>>> Now, I'm trying to understand how CAP theorem applies here. Traditi=
onal PostgreSQL replication has clear CAP trade-offs - you choose betwee=
n consistency and availability during partitions.
>>>>=20
>>>> But when PostgreSQL instances share storage rather than replicate:
>>>> - Consistency seems maintained (same data)
>>>> - Availability seems maintained (client can always promote an acces=
sible node)
>>>> - Partitions between PostgreSQL nodes don't prevent the system from=
 functioning
>>>>=20
>>>> It seems that CAP assumes specific implementation details (like nod=
es maintaining independent state) without explicitly stating them.
>>>>=20
>>>> How should we think about CAP theorem when distributed nodes share =
storage rather than coordinate state? Are the trade-offs simply moved to=
 a different layer, or does shared storage fundamentally change the anal=
ysis?
>>>>=20
>>>> Client with awareness of both PostgreSQL nodes
>>>>     |                               |
>>>>     =E2=86=93 (partition here)              =E2=86=93
>>>> PostgreSQL Primary              PostgreSQL Standby
>>>>     |                               |
>>>>     =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=
=E2=94=80=E2=94=98
>>>>                 =E2=86=93
>>>>          Shared ZFS Pool
>>>>                 |
>>>>          6 Global ZeroFS instances
>>>>=20
>>>> Best,
>>>> Pierre
>>>>=20
>>>> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
>>>> > Hi Seref,
>>>> >
>>>> > For the benchmarks, I used Hetzner's cloud service with the follo=
wing setup:
>>>> >
>>>> > - A Hetzner s3 bucket in the FSN1 region
>>>> > - A virtual machine of type ccx63 48 vCPU 192 GB memory
>>>> > - 3 ZeroFS nbd devices (same s3 bucket)
>>>> > - A ZFS stripped pool with the 3 devices
>>>> > - 200GB zfs L2ARC
>>>> > - Postgres configured accordingly memory-wise as well as with syn=
chronous_commit =3D off, wal_init_zero =3D off and wal_recycle =3D off.
>>>> >
>>>> > Best,
>>>> > Pierre
>>>> >
>>>> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>>>> >> Sorry, this was meant to go to the whole group:
>>>> >>
>>>> >> Very interesting!. Great work. Can you clarify how exactly you'r=
e running postgres in your tests? A specific AWS service? What's the tes=
t infrastructure that sits above the file system?
>>>> >>
>>>> >> On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre <pierre@ba=
rre.sh> wrote:
>>>> >>> Hi everyone,
>>>> >>>
>>>> >>> I wanted to share a project I've been working on that enables P=
ostgreSQL to run on S3 storage while maintaining performance comparable =
to local NVMe. The approach uses block-level access rather than trying t=
o map filesystem operations to S3 objects.
>>>> >>>
>>>> >>> ZeroFS: https://github.com/Barre/ZeroFS
>>>> >>>
>>>> >>> # The Architecture
>>>> >>>
>>>> >>> ZeroFS provides NBD (Network Block Device) servers that expose =
S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools=
 built on these block devices:
>>>> >>>
>>>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>> >>>
>>>> >>> By providing block-level access and leveraging ZFS's caching ca=
pabilities (L2ARC), we can achieve microsecond latencies despite the und=
erlying storage being in S3.
>>>> >>>
>>>> >>> ## Performance Results
>>>> >>>
>>>> >>> Here are pgbench results from PostgreSQL running on this setup:
>>>> >>>
>>>> >>> ### Read/Write Workload
>>>> >>>
>>>> >>> ```
>>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10000=
0 example
>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> >>> starting vacuum...end.
>>>> >>> transaction type: <builtin: TPC-B (sort of)>
>>>> >>> scaling factor: 50
>>>> >>> query mode: simple
>>>> >>> number of clients: 50
>>>> >>> number of threads: 15
>>>> >>> maximum number of tries: 1
>>>> >>> number of transactions per client: 100000
>>>> >>> number of transactions actually processed: 5000000/5000000
>>>> >>> number of failed transactions: 0 (0.000%)
>>>> >>> latency average =3D 0.943 ms
>>>> >>> initial connection time =3D 48.043 ms
>>>> >>> tps =3D 53041.006947 (without initial connection time)
>>>> >>> ```
>>>> >>>
>>>> >>> ### Read-Only Workload
>>>> >>>
>>>> >>> ```
>>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10000=
0 -S example
>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> >>> starting vacuum...end.
>>>> >>> transaction type: <builtin: select only>
>>>> >>> scaling factor: 50
>>>> >>> query mode: simple
>>>> >>> number of clients: 50
>>>> >>> number of threads: 15
>>>> >>> maximum number of tries: 1
>>>> >>> number of transactions per client: 100000
>>>> >>> number of transactions actually processed: 5000000/5000000
>>>> >>> number of failed transactions: 0 (0.000%)
>>>> >>> latency average =3D 0.121 ms
>>>> >>> initial connection time =3D 53.358 ms
>>>> >>> tps =3D 413436.248089 (without initial connection time)
>>>> >>> ```
>>>> >>>
>>>> >>> These numbers are with 50 concurrent clients and the actual dat=
a stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory ca=
ches, while cold data comes from S3.
>>>> >>>
>>>> >>> ## How It Works
>>>> >>>
>>>> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL=
/ZFS can use like any other block device
>>>> >>> 2. Multiple cache layers hide S3 latency:
>>>> >>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>>>> >>>    b. ZeroFS memory cache for metadata and hot dataZeroFS expos=
es NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any ot=
her block device
>>>> >>>    c. Optional local disk cache
>>>> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>>> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS'=
 LSM-tree
>>>> >>>
>>>> >>> ## Geo-Distributed PostgreSQL
>>>> >>>
>>>> >>> Since each region can run its own ZeroFS instance, you can crea=
te geographically distributed PostgreSQL setups.
>>>> >>>
>>>> >>> Example architectures:
>>>> >>>
>>>> >>> Architecture 1
>>>> >>>
>>>> >>>
>>>> >>>                          PostgreSQL Client
>>>> >>>                                    |
>>>> >>>                                    | SQL queries
>>>> >>>                                    |
>>>> >>>                             +--------------+
>>>> >>>                             |  PG Proxy    |
>>>> >>>                             | (HAProxy/    |
>>>> >>>                             |  PgBouncer)  |
>>>> >>>                             +--------------+
>>>> >>>                                /        \
>>>> >>>                               /          \
>>>> >>>                    Synchronous            Synchronous
>>>> >>>                    Replication            Replication
>>>> >>>                             /              \
>>>> >>>                            /                \
>>>> >>>               +---------------+        +---------------+
>>>> >>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>>>> >>>               | (Primary)     |=E2=97=84------=E2=96=BA| (Stand=
by)     |
>>>> >>>               +---------------+        +---------------+
>>>> >>>                       |                        |
>>>> >>>                       |  POSIX filesystem ops  |
>>>> >>>                       |                        |
>>>> >>>               +---------------+        +---------------+
>>>> >>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>>>> >>>               | (3-way mirror)|        | (3-way mirror)|
>>>> >>>               +---------------+        +---------------+
>>>> >>>                /      |      \          /      |      \
>>>> >>>               /       |       \        /       |       \
>>>> >>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:=
10814
>>>> >>>              |        |        |           |        |        |
>>>> >>>         +--------++--------++--------++--------++--------++----=
----+
>>>> >>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||Zero=
FS 6|
>>>> >>>         +--------++--------++--------++--------++--------++----=
----+
>>>> >>>              |         |         |         |         |         |
>>>> >>>              |         |         |         |         |         |
>>>> >>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 =
S3-Region6
>>>> >>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap=
-east)
>>>> >>>
>>>> >>> Architecture 2:
>>>> >>>
>>>> >>> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Sta=
ndby (Region 2)
>>>> >>>                 \                    /
>>>> >>>                  \                  /
>>>> >>>                   Same ZFS Pool (NBD)
>>>> >>>                          |
>>>> >>>                   6 Global ZeroFS
>>>> >>>                          |
>>>> >>>                       S3 Regions
>>>> >>>
>>>> >>>
>>>> >>> The main advantages I see are:
>>>> >>> 1. Dramatic cost reduction for large datasets
>>>> >>> 2. Simplified geo-distribution
>>>> >>> 3. Infinite storage capacity
>>>> >>> 4. Built-in encryption and compression
>>>> >>>
>>>> >>> Looking forward to your feedback and questions!
>>>> >>>
>>>> >>> Best,
>>>> >>> Pierre
>>>> >>>
>>>> >>> P.S. The full project includes a custom NFS filesystem too.
>>>> >>>
>>>> >
>>>>=20
>>=20

--d5aa1bab362c4f96945cfcf1af96a23b
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html><html><head><title></title></head><body><div>Ah, by "shar=
ed storage" I mean that each node can acquire exclusivity, not that they=
 can both R/W to it at the same time.</div><div><br></div><div>&gt;&nbsp=
;Some pretty well-known cases of storage / compute separation (Aurora, N=
eon) also share the storage between instances,</div><div><br></div><div>=
That model is cool, but I think it's more of a solution for outliers as =
I was suggesting, not something that most would or should want.</div><di=
v><br></div><div>Best,</div><div>Pierre</div><div><br></div><div>On Sat,=
 Jul 26, 2025, at 09:42, Vladimir Churyukin wrote:</div><blockquote type=
=3D"cite" id=3D"qt" style=3D""><div dir=3D"ltr"><div>Sorry, I was referr=
ing to this:</div><div><br></div><div><div>&gt;&nbsp;&nbsp;<span style=3D=
"color:#500050;">But when PostgreSQL instances share storage rather than=
 replicate:</span></div><div><span style=3D"color:#500050;">&gt; - Consi=
stency seems maintained (same data)</span></div><div><span style=3D"colo=
r:#500050;">&gt; - Availability seems maintained (client can always prom=
ote an accessible node)</span></div><div><span style=3D"color:#500050;">=
&gt; - Partitions between PostgreSQL nodes don't prevent the system from=
 functioning</span></div></div><div><span style=3D"color:#500050;"></spa=
n><br></div><div>Some pretty well-known cases of storage / compute separ=
ation (Aurora, Neon) also share the storage between instances,</div><div=
>that's why I'm a bit confused by your reply. I thought you're thinking =
about this approach too, that's why I mentioned what kind of challenges =
one may have on that path.</div><div><span class=3D"color" style=3D"colo=
r:#500050;"></span><br></div></div><div><br></div><div class=3D"qt-gmail=
_quote qt-gmail_quote_container"><div dir=3D"ltr" class=3D"qt-gmail_attr=
">On Sat, Jul 26, 2025 at 12:36=E2=80=AFAM Pierre Barre &lt;<a href=3D"m=
ailto:pierre@barre.sh">pierre@barre.sh</a>&gt; wrote:</div><blockquote c=
lass=3D"qt-gmail_quote" style=3D"margin-top:0px;margin-right:0px;margin-=
bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-style:sol=
id;border-left-color:rgb(204, 204, 204);padding-left:1ex;"><div><u></u><=
br></div><div><div>What you describe doesn=E2=80=99t look like something=
 very useful for the vast majority of projects that needs a database. Wh=
y would you even want that if you can avoid it?&nbsp;</div><div><br></di=
v><div>If your =E2=80=9Csingle node=E2=80=9D can handle tens / hundreds =
of thousands requests per second, still have very durable and highly ava=
ilable storage, as well as fast recovery mechanisms, what=E2=80=99s the =
point?</div><div><br></div><div>I am not trying to cater to extreme outl=
iers that may want very weird like this, that=E2=80=99s just not the use=
-cases I want to address, because I believe they are few and far between=
.</div><div><br></div><div>Best,</div><div>Pierre&nbsp;</div><div><br></=
div><div>On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:</div>=
<blockquote type=3D"cite" id=3D"qt-m_7592450530125555523qt"><div dir=3D"=
ltr"><div>A shared storage would require a lot of extra work. That's ess=
entially what AWS Aurora does.</div><div>You will have to have functiona=
lity to sync in-memory states between nodes, because all the instances w=
ill have cached data that can easily become stale on any write operation=
.</div><div>That alone is not that simple. You will have to modify some =
locking logic. Most likely do a lot of other changes in a lot of places,=
 Postgres was not just built with the assumption that the storage can be=
 shared.</div><div><br></div><div>-Vladimir</div></div><div><br></div><d=
iv><div dir=3D"ltr">On Fri, Jul 18, 2025 at 5:31=E2=80=AFAM Pierre Barre=
 &lt;<a href=3D"mailto:pierre@barre.sh" target=3D"_blank">pierre@barre.s=
h</a>&gt; wrote:</div><blockquote style=3D"margin-top:0px;margin-right:0=
px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left=
-style:solid;border-left-color:rgb(204, 204, 204);padding-left:1ex;"><di=
v>Now, I'm trying to understand how CAP theorem applies here. Traditiona=
l PostgreSQL replication has clear CAP trade-offs - you choose between c=
onsistency and availability during partitions.</div><div><br></div><div>=
But when PostgreSQL instances share storage rather than replicate:</div>=
<div>- Consistency seems maintained (same data)</div><div>- Availability=
 seems maintained (client can always promote an accessible node)</div><d=
iv>- Partitions between PostgreSQL nodes don't prevent the system from f=
unctioning</div><div><br></div><div>It seems that CAP assumes specific i=
mplementation details (like nodes maintaining independent state) without=
 explicitly stating them.</div><div><br></div><div>How should we think a=
bout CAP theorem when distributed nodes share storage rather than coordi=
nate state? Are the trade-offs simply moved to a different layer, or doe=
s shared storage fundamentally change the analysis?</div><div><br></div>=
<div>Client with awareness of both PostgreSQL nodes</div><div>&nbsp; &nb=
sp; |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|</div><div>&nbsp; &nbsp; =E2=
=86=93 (partition here)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
=E2=86=93</div><div>PostgreSQL Primary&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; PostgreSQL Standby</div><div>&nbsp; &nbsp; |&nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp;|</div><div>&nbsp; &nbsp; =E2=94=94=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=
=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98</div><div>&=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =E2=86=93</div><d=
iv>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Shared ZFS Pool</div><div>&nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |</div><div>&nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp;6 Global ZeroFS instances</div><div><br></div><div>=
Best,</div><div>Pierre</div><div><br></div><div>On Fri, Jul 18, 2025, at=
 12:57, Pierre Barre wrote:</div><div>&gt; Hi Seref,</div><div>&gt;</div=
><div>&gt; For the benchmarks, I used Hetzner's cloud service with the f=
ollowing setup:</div><div>&gt;</div><div>&gt; - A Hetzner s3 bucket in t=
he FSN1 region</div><div>&gt; - A virtual machine of type ccx63 48 vCPU =
192 GB memory</div><div>&gt; - 3 ZeroFS nbd devices (same s3 bucket)</di=
v><div>&gt; - A ZFS stripped pool with the 3 devices</div><div>&gt; - 20=
0GB zfs L2ARC</div><div>&gt; - Postgres configured accordingly memory-wi=
se as well as with synchronous_commit =3D off, wal_init_zero =3D off and=
 wal_recycle =3D off.</div><div>&gt;</div><div>&gt; Best,</div><div>&gt;=
 Pierre</div><div>&gt;</div><div>&gt; On Fri, Jul 18, 2025, at 12:42, Se=
ref Arikan wrote:</div><div>&gt;&gt; Sorry, this was meant to go to the =
whole group:</div><div>&gt;&gt;</div><div>&gt;&gt; Very interesting!. Gr=
eat work. Can you clarify how exactly you're running postgres in your te=
sts? A specific AWS service? What's the test infrastructure that sits ab=
ove the file system?</div><div>&gt;&gt;</div><div>&gt;&gt; On Thu, Jul 1=
7, 2025 at 11:59=E2=80=AFPM Pierre Barre &lt;<a href=3D"mailto:pierre@ba=
rre.sh" target=3D"_blank">pierre@barre.sh</a>&gt; wrote:</div><div>&gt;&=
gt;&gt; Hi everyone,</div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&gt; I wan=
ted to share a project I've been working on that enables PostgreSQL to r=
un on S3 storage while maintaining performance comparable to local NVMe.=
 The approach uses block-level access rather than trying to map filesyst=
em operations to S3 objects.</div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&g=
t; ZeroFS: <a href=3D"https://github.com/Barre/ZeroFS" rel=3D"noreferrer=
" target=3D"_blank">https://github.com/Barre/ZeroFS</a></div><div>&gt;&g=
t;&gt;</div><div>&gt;&gt;&gt; # The Architecture</div><div>&gt;&gt;&gt;<=
/div><div>&gt;&gt;&gt; ZeroFS provides NBD (Network Block Device) server=
s that expose S3 storage as raw block devices. PostgreSQL runs unmodifie=
d on ZFS pools built on these block devices:</div><div>&gt;&gt;&gt;</div=
><div>&gt;&gt;&gt; PostgreSQL -&gt; ZFS -&gt; NBD -&gt; ZeroFS -&gt; S3<=
/div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&gt; By providing block-level a=
ccess and leveraging ZFS's caching capabilities (L2ARC), we can achieve =
microsecond latencies despite the underlying storage being in S3.</div><=
div>&gt;&gt;&gt;</div><div>&gt;&gt;&gt; ## Performance Results</div><div=
>&gt;&gt;&gt;</div><div>&gt;&gt;&gt; Here are pgbench results from Postg=
reSQL running on this setup:</div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&g=
t; ### Read/Write Workload</div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&gt;=
 ```</div><div>&gt;&gt;&gt; postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -=
c 50 -j 15 -t 100000 example</div><div>&gt;&gt;&gt; pgbench (16.9 (Ubunt=
u 16.9-0ubuntu0.24.04.1))</div><div>&gt;&gt;&gt; starting vacuum...end.<=
/div><div>&gt;&gt;&gt; transaction type: &lt;builtin: TPC-B (sort of)&gt=
;</div><div>&gt;&gt;&gt; scaling factor: 50</div><div>&gt;&gt;&gt; query=
 mode: simple</div><div>&gt;&gt;&gt; number of clients: 50</div><div>&gt=
;&gt;&gt; number of threads: 15</div><div>&gt;&gt;&gt; maximum number of=
 tries: 1</div><div>&gt;&gt;&gt; number of transactions per client: 1000=
00</div><div>&gt;&gt;&gt; number of transactions actually processed: 500=
0000/5000000</div><div>&gt;&gt;&gt; number of failed transactions: 0 (0.=
000%)</div><div>&gt;&gt;&gt; latency average =3D 0.943 ms</div><div>&gt;=
&gt;&gt; initial connection time =3D 48.043 ms</div><div>&gt;&gt;&gt; tp=
s =3D 53041.006947 (without initial connection time)</div><div>&gt;&gt;&=
gt; ```</div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&gt; ### Read-Only Work=
load</div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&gt; ```</div><div>&gt;&gt=
;&gt; postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -=
S example</div><div>&gt;&gt;&gt; pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.=
04.1))</div><div>&gt;&gt;&gt; starting vacuum...end.</div><div>&gt;&gt;&=
gt; transaction type: &lt;builtin: select only&gt;</div><div>&gt;&gt;&gt=
; scaling factor: 50</div><div>&gt;&gt;&gt; query mode: simple</div><div=
>&gt;&gt;&gt; number of clients: 50</div><div>&gt;&gt;&gt; number of thr=
eads: 15</div><div>&gt;&gt;&gt; maximum number of tries: 1</div><div>&gt=
;&gt;&gt; number of transactions per client: 100000</div><div>&gt;&gt;&g=
t; number of transactions actually processed: 5000000/5000000</div><div>=
&gt;&gt;&gt; number of failed transactions: 0 (0.000%)</div><div>&gt;&gt=
;&gt; latency average =3D 0.121 ms</div><div>&gt;&gt;&gt; initial connec=
tion time =3D 53.358 ms</div><div>&gt;&gt;&gt; tps =3D 413436.248089 (wi=
thout initial connection time)</div><div>&gt;&gt;&gt; ```</div><div>&gt;=
&gt;&gt;</div><div>&gt;&gt;&gt; These numbers are with 50 concurrent cli=
ents and the actual data stored in S3. Hot data is served from ZFS L2ARC=
 and ZeroFS's memory caches, while cold data comes from S3.</div><div>&g=
t;&gt;&gt;</div><div>&gt;&gt;&gt; ## How It Works</div><div>&gt;&gt;&gt;=
</div><div>&gt;&gt;&gt; 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) =
that PostgreSQL/ZFS can use like any other block device</div><div>&gt;&g=
t;&gt; 2. Multiple cache layers hide S3 latency:</div><div>&gt;&gt;&gt;&=
nbsp; &nbsp; a. ZFS ARC/L2ARC for frequently accessed blocks</div><div>&=
gt;&gt;&gt;&nbsp; &nbsp; b. ZeroFS memory cache for metadata and hot dat=
aZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can us=
e like any other block device</div><div>&gt;&gt;&gt;&nbsp; &nbsp; c. Opt=
ional local disk cache</div><div>&gt;&gt;&gt; 3. All data is encrypted (=
ChaCha20-Poly1305) before hitting S3</div><div>&gt;&gt;&gt; 4. Files are=
 split into 128KB chunks for insertion into ZeroFS' LSM-tree</div><div>&=
gt;&gt;&gt;</div><div>&gt;&gt;&gt; ## Geo-Distributed PostgreSQL</div><d=
iv>&gt;&gt;&gt;</div><div>&gt;&gt;&gt; Since each region can run its own=
 ZeroFS instance, you can create geographically distributed PostgreSQL s=
etups.</div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&gt; Example architectur=
es:</div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&gt; Architecture 1</div><d=
iv>&gt;&gt;&gt;</div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&gt;&nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; PostgreSQL Client</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; |</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | SQL queries</div><div>&gt;&gt;&gt;=
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |</div><div>&gt;&g=
t;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+--------------+</div><div>&gt;&gt=
;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; PG Proxy&nbsp; &nbsp; |</di=
v><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| (HAProxy/&nbsp; &nb=
sp; |</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; PgBo=
uncer)&nbsp; |</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+--=
------------+</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; /&nbsp; &nbsp; &nbsp; &nbsp; \</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp;/&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \</div><div>=
&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; Synchronous&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Synchrono=
us</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; Replication&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; Replication</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;/&=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \</div><div>&gt;&gt;&gt;=
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; /&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; \</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp;+---------------+&nbsp; &nbsp; &nbsp; &nbsp; +--------=
-------+</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp;| PostgreSQL 1&nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; | Postgr=
eSQL 2&nbsp; |</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp;| (Primary)&nbsp; &nbsp; &nbsp;|=E2=97=84------=E2=96=
=BA| (Standby)&nbsp; &nbsp; &nbsp;|</div><div>&gt;&gt;&gt;&nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+---------------+&nbsp; &nbsp; =
&nbsp; &nbsp; +---------------+</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp; |</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; POSIX filesystem ops&nb=
sp; |</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |</div><div>&gt;&gt;=
&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+------------=
---+&nbsp; &nbsp; &nbsp; &nbsp; +---------------+</div><div>&gt;&gt;&gt;=
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp;ZFS=
 Pool 1&nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp;ZFS Pool 2&nbsp=
; |</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp;| (3-way mirror)|&nbsp; &nbsp; &nbsp; &nbsp; | (3-way mirror)|<=
/div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp;+---------------+&nbsp; &nbsp; &nbsp; &nbsp; +---------------+</div=
><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; /&nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; \&nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; /&nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; \</div><div>&gt;&g=
t;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;/&nbsp; &nb=
sp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp;\&nbsp; &nbsp; &nbsp; &nbsp=
; /&nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp;\</div><div>&g=
t;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NBD:10809 NBD:10810 NBD:1081=
1&nbsp; NBD:10812 NBD:10813 NBD:10814</div><div>&gt;&gt;&gt;&nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; |&nbsp=
; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp;=
 &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; |</div><div>&gt;&gt;&=
gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+--------++--------++--------++----=
----++--------++--------+</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp;|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|</=
div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;+--------++------=
--++--------++--------++--------++--------+</div><div>&gt;&gt;&gt;&nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p;|</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|=
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|</div><div>&gt;&gt;&gt;&nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp;S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Regio=
n5 S3-Region6</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;(u=
s-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)</div><div>&g=
t;&gt;&gt;</div><div>&gt;&gt;&gt; Architecture 2:</div><div>&gt;&gt;&gt;=
</div><div>&gt;&gt;&gt; PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92=
 PostgreSQL Standby (Region 2)</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; /</div><div>&gt;&gt;&gt;&nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \&nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; /</div><div>&gt;&gt;&gt=
;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Sa=
me ZFS Pool (NBD)</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |</div><div>&=
gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp;6 Global ZeroFS</div><div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp=
; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |</div>=
<div>&gt;&gt;&gt;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp;S3 Regions</div><div>&gt;&gt;&gt;</div><div>=
&gt;&gt;&gt;</div><div>&gt;&gt;&gt; The main advantages I see are:</div>=
<div>&gt;&gt;&gt; 1. Dramatic cost reduction for large datasets</div><di=
v>&gt;&gt;&gt; 2. Simplified geo-distribution</div><div>&gt;&gt;&gt; 3. =
Infinite storage capacity</div><div>&gt;&gt;&gt; 4. Built-in encryption =
and compression</div><div>&gt;&gt;&gt;</div><div>&gt;&gt;&gt; Looking fo=
rward to your feedback and questions!</div><div>&gt;&gt;&gt;</div><div>&=
gt;&gt;&gt; Best,</div><div>&gt;&gt;&gt; Pierre</div><div>&gt;&gt;&gt;</=
div><div>&gt;&gt;&gt; P.S. The full project includes a custom NFS filesy=
stem too.</div><div>&gt;&gt;&gt;</div><div>&gt;</div><div><br></div></bl=
ockquote></div></blockquote><div><br></div></div></blockquote></div></bl=
ockquote><div><br></div></body></html>
--d5aa1bab362c4f96945cfcf1af96a23b--