Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uckFS-00GZs9-3I for pgsql-general@arkaria.postgresql.org; Fri, 18 Jul 2025 12:31:34 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uckFP-00BvtH-2v for pgsql-general@arkaria.postgresql.org; Fri, 18 Jul 2025 12:31:31 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uckFO-00Bvt8-8a for pgsql-general@lists.postgresql.org; Fri, 18 Jul 2025 12:31:31 +0000 Received: from fhigh-b5-smtp.messagingengine.com ([202.12.124.156]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1uckFL-007wFM-0m for pgsql-general@lists.postgresql.org; Fri, 18 Jul 2025 12:31:29 +0000 Received: from phl-compute-01.internal (phl-compute-01.phl.internal [10.202.2.41]) by mailfhigh.stl.internal (Postfix) with ESMTP id 1BF7C7A006A for ; Fri, 18 Jul 2025 08:31:26 -0400 (EDT) Received: from phl-imap-04 ([10.202.2.82]) by phl-compute-01.internal (MEProxy); Fri, 18 Jul 2025 08:31:26 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=barre.sh; h=cc :cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to; s=fm1; t=1752841885; x= 1752928285; bh=aYii8iHQqQWhCUgLBEBAWgnyfJ92weyqiLdIfo7YLts=; b=n yCZGrXnqpQNPrfq92w/9Glh2C1kebBB4ZBYWxNWAp8QzKcT2efZfUpdAl+Ed68iF VoaQBoliyJODuDkrQLGZvyK+tIyww8o7hYOnkT2AI2wNoylhN/p1R/vVe4K9C5yx oexuMcWZvE7jcG0B43y7SObIyKHbkZ/OD3b0XYNOumers92GNRtp71G61r3OISSC MaXAvGbgHG0C8EmEfUa4jSkmUrNlwr+NaG0Yi1TpvDi6r22hcqcUb7YduGJ/5aBi l2+jdGR+3RmhN/B5sUVCXBKFyy+YX5JrnkRoAkiu9iluO7ByobiPpeDzikRAzXFp L+G5m5tMEch2DkycV29tA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1752841885; x=1752928285; bh=a Yii8iHQqQWhCUgLBEBAWgnyfJ92weyqiLdIfo7YLts=; b=jHrdN/1iTfTjLllWi vv6NwBHFEHtv56fwf8eUIHPj5NbHdsplcvEGzHnAVHkyBMpgKI0jfLS1Qr+bmV1f GEm5so0MTj+0RE8aiaVm7vtxF2FEKjOb6zc9+hWtD1nMmj8lqnNAtc/6I5NJIPPR zAXGi8+H3nSF5FSWUjrQ9+gQvTqlzCxZH76QJgQiY4Bl/Q/BGeql/jFeDzvscmh4 cpP78OGLN/6c1bKDE7bBXSxtIW0r1yD5Shr06n2WcM1ErtN9y4TzV1CpzoRNE7bm 7EpcszKREsuq7p4QGeOkg4XUhhYDTkcanp4yOabVCGpFB7RLif0iarzGJ1rUxRdz nHrNQ== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdeifeegiecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecumhhishhsihhnghcuvffquchfihgvlhguucdlfedtmdenuc fjughrpefoggffhfevkfgjfhfutgfgsehtqhertdertdejnecuhfhrohhmpedfrfhivghr rhgvuceurghrrhgvfdcuoehpihgvrhhrvgessggrrhhrvgdrshhhqeenucggtffrrghtth gvrhhnpeffueetheegueeuleehueeuudfhtdeijeejuefhudffffeuuedutdfggffhvdet geenucffohhmrghinhepghhithhhuhgsrdgtohhmnecuvehluhhsthgvrhfuihiivgeptd enucfrrghrrghmpehmrghilhhfrhhomhepphhivghrrhgvsegsrghrrhgvrdhshhdpnhgs pghrtghpthhtohepuddpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepphhgshhqlh dqghgvnhgvrhgrlheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: i97614980:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id 5E846B6006B; Fri, 18 Jul 2025 08:31:25 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface MIME-Version: 1.0 X-ThreadId: T89c86ea8eb4c36ce Date: Fri, 18 Jul 2025 14:31:05 +0200 From: "Pierre Barre" Cc: pgsql-general@lists.postgresql.org Message-Id: In-Reply-To: <8188513c-e089-4273-b2be-16dd0a5a0a80@app.fastmail.com> References: <8188513c-e089-4273-b2be-16dd0a5a0a80@app.fastmail.com> Subject: Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Now, I'm trying to understand how CAP theorem applies here. Traditional = PostgreSQL replication has clear CAP trade-offs - you choose between con= sistency and availability during partitions. But when PostgreSQL instances share storage rather than replicate: - Consistency seems maintained (same data) - Availability seems maintained (client can always promote an accessible= node) - Partitions between PostgreSQL nodes don't prevent the system from func= tioning It seems that CAP assumes specific implementation details (like nodes ma= intaining independent state) without explicitly stating them. How should we think about CAP theorem when distributed nodes share stora= ge rather than coordinate state? Are the trade-offs simply moved to a di= fferent layer, or does shared storage fundamentally change the analysis? Client with awareness of both PostgreSQL nodes | | =E2=86=93 (partition here) =E2=86=93 PostgreSQL Primary PostgreSQL Standby | | =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=98 =E2=86=93 Shared ZFS Pool | 6 Global ZeroFS instances Best, Pierre On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote: > Hi Seref, >=20 > For the benchmarks, I used Hetzner's cloud service with the following = setup: >=20 > - A Hetzner s3 bucket in the FSN1 region > - A virtual machine of type ccx63 48 vCPU 192 GB memory > - 3 ZeroFS nbd devices (same s3 bucket) > - A ZFS stripped pool with the 3 devices > - 200GB zfs L2ARC > - Postgres configured accordingly memory-wise as well as with synchron= ous_commit =3D off, wal_init_zero =3D off and wal_recycle =3D off. >=20 > Best, > Pierre >=20 > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: >> Sorry, this was meant to go to the whole group: >>=20 >> Very interesting!. Great work. Can you clarify how exactly you're run= ning postgres in your tests? A specific AWS service? What's the test inf= rastructure that sits above the file system? >>=20 >> On Thu, Jul 17, 2025 at 11:59=E2=80=AFPM Pierre Barre wrote: >>> Hi everyone, >>>=20 >>> I wanted to share a project I've been working on that enables Postgr= eSQL to run on S3 storage while maintaining performance comparable to lo= cal NVMe. The approach uses block-level access rather than trying to map= filesystem operations to S3 objects. >>>=20 >>> ZeroFS: https://github.com/Barre/ZeroFS >>>=20 >>> # The Architecture >>>=20 >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 st= orage as raw block devices. PostgreSQL runs unmodified on ZFS pools buil= t on these block devices: >>>=20 >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >>>=20 >>> By providing block-level access and leveraging ZFS's caching capabil= ities (L2ARC), we can achieve microsecond latencies despite the underlyi= ng storage being in S3. >>>=20 >>> ## Performance Results >>>=20 >>> Here are pgbench results from PostgreSQL running on this setup: >>>=20 >>> ### Read/Write Workload >>>=20 >>> ``` >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 exa= mple >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>> starting vacuum...end. >>> transaction type: >>> scaling factor: 50 >>> query mode: simple >>> number of clients: 50 >>> number of threads: 15 >>> maximum number of tries: 1 >>> number of transactions per client: 100000 >>> number of transactions actually processed: 5000000/5000000 >>> number of failed transactions: 0 (0.000%) >>> latency average =3D 0.943 ms >>> initial connection time =3D 48.043 ms >>> tps =3D 53041.006947 (without initial connection time) >>> ``` >>>=20 >>> ### Read-Only Workload >>>=20 >>> ``` >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S = example >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>> starting vacuum...end. >>> transaction type: >>> scaling factor: 50 >>> query mode: simple >>> number of clients: 50 >>> number of threads: 15 >>> maximum number of tries: 1 >>> number of transactions per client: 100000 >>> number of transactions actually processed: 5000000/5000000 >>> number of failed transactions: 0 (0.000%) >>> latency average =3D 0.121 ms >>> initial connection time =3D 53.358 ms >>> tps =3D 413436.248089 (without initial connection time) >>> ``` >>>=20 >>> These numbers are with 50 concurrent clients and the actual data sto= red in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches,= while cold data comes from S3. >>>=20 >>> ## How It Works >>>=20 >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS = can use like any other block device >>> 2. Multiple cache layers hide S3 latency: >>> a. ZFS ARC/L2ARC for frequently accessed blocks >>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NB= D devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other b= lock device >>> c. Optional local disk cache >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-= tree >>>=20 >>> ## Geo-Distributed PostgreSQL >>>=20 >>> Since each region can run its own ZeroFS instance, you can create ge= ographically distributed PostgreSQL setups. >>>=20 >>> Example architectures: >>>=20 >>> Architecture 1 >>>=20 >>>=20 >>> PostgreSQL Client >>> | >>> | SQL queries >>> | >>> +--------------+ >>> | PG Proxy | >>> | (HAProxy/ | >>> | PgBouncer) | >>> +--------------+ >>> / \ >>> / \ >>> Synchronous Synchronous >>> Replication Replication >>> / \ >>> / \ >>> +---------------+ +---------------+ >>> | PostgreSQL 1 | | PostgreSQL 2 | >>> | (Primary) |=E2=97=84------=E2=96=BA| (Standby) = | >>> +---------------+ +---------------+ >>> | | >>> | POSIX filesystem ops | >>> | | >>> +---------------+ +---------------+ >>> | ZFS Pool 1 | | ZFS Pool 2 | >>> | (3-way mirror)| | (3-way mirror)| >>> +---------------+ +---------------+ >>> / | \ / | \ >>> / | \ / | \ >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 >>> | | | | | | >>> +--------++--------++--------++--------++--------++--------+ >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| >>> +--------++--------++--------++--------++--------++--------+ >>> | | | | | | >>> | | | | | | >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Re= gion6 >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) >>>=20 >>> Architecture 2: >>>=20 >>> PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Standby = (Region 2) >>> \ / >>> \ / >>> Same ZFS Pool (NBD) >>> | >>> 6 Global ZeroFS >>> | >>> S3 Regions >>>=20 >>>=20 >>> The main advantages I see are: >>> 1. Dramatic cost reduction for large datasets >>> 2. Simplified geo-distribution >>> 3. Infinite storage capacity >>> 4. Built-in encryption and compression >>>=20 >>> Looking forward to your feedback and questions! >>>=20 >>> Best, >>> Pierre >>>=20 >>> P.S. The full project includes a custom NFS filesystem too. >>>=20 >=20