Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ucXZk-00E1Bh-5c for pgsql-general@arkaria.postgresql.org; Thu, 17 Jul 2025 22:59:40 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ucXZg-0060RP-PI for pgsql-general@arkaria.postgresql.org; Thu, 17 Jul 2025 22:59:37 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ucXZf-0060Qx-T0 for pgsql-general@lists.postgresql.org; Thu, 17 Jul 2025 22:59:37 +0000 Received: from fhigh-a6-smtp.messagingengine.com ([103.168.172.157]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1ucXZd-008KsI-1x for pgsql-general@lists.postgresql.org; Thu, 17 Jul 2025 22:59:36 +0000 Received: from phl-compute-01.internal (phl-compute-01.phl.internal [10.202.2.41]) by mailfhigh.phl.internal (Postfix) with ESMTP id 02D5614001B0 for ; Thu, 17 Jul 2025 18:59:31 -0400 (EDT) Received: from phl-imap-04 ([10.202.2.82]) by phl-compute-01.internal (MEProxy); Thu, 17 Jul 2025 18:59:31 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=barre.sh; h=cc :content-transfer-encoding:content-type:content-type:date:date :from:from:in-reply-to:message-id:mime-version:reply-to:subject :subject:to:to; s=fm1; t=1752793170; x=1752879570; bh=uWEkNdWZ9M jz1PAyk5H2XT++OpeAiCErU2CTwncBXrc=; b=EnfGaI0zqGKQVpyy2Z2SFcse0N BWz4JqPyalX/PrvqPMZ4rkDJWe9pWeYbWlZzkeh48qKtXH7ExnlWODAJ7zwfazoW KkKDyvQFNnlUa6uEJFUabiaG6OaAQaq2t+9AYyhhd38xD3Pc9bQzuFQ60gx4+tqn X4tNYoHOVZUOwC4VAsX8OifweloIopz28hjDxi+V8flB8v30QZ0IImO69kA5hgWW 2G+1Kv7Bp44l7OrGeSDMZ/Nu47Pno/AzFX91z2EfplkZd8Kcs1fOxVHBCcWwZZ4/ 2goFRi8mwdmyj3bTymzwIR22gSB0oYphJxlkw5GMLvrYh1WmjUW/nvE0jclg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:message-id:mime-version:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1752793170; x=1752879570; bh=uWEkNdWZ9Mjz1PAyk5H2XT++OpeAiCErU2C TwncBXrc=; b=H1wY3Y7qPWkytm2CWeWR3c0ADbfWlDpN2KJQoiHY849I2MKJQQS yaoXDJv6Uo1vFt1e6G79gGkQyXbfHdLGmRM6JzFpmtyBad4SW579uC8CZ/KkkUDE iTaaZ7ZR4WGxRfEz5oJ61tpYaWUs+yutM6WS9655gVKJ4LcG8bk8VBChrDaNXjwW AFeQacV6Mu/kiFx4ON9iTWb3j/6Zptsjx0Igi8U2h0y7aQGdVOaHTMDN/FRs4MXp t73THwufoozvhZibZQkWKfY6nUcuL3uEhXBHT9bu/px6yj4JdRLdFQt6bYsJAXRP tcQ5uK5/vlrEE2DjiXs7P1SxZs1iE+lblvw== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdeiudekgecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecunecujfgurhepofggfffhvffkufgtgfesthhqredtredtje enucfhrhhomhepfdfrihgvrhhrvgcuuegrrhhrvgdfuceophhivghrrhgvsegsrghrrhgv rdhshheqnecuggftrfgrthhtvghrnhepfeevtdfggeelffelkefhheevhfffjeeikedtje fghfeugeelgeevteeftdetudefnecuffhomhgrihhnpehgihhthhhusgdrtghomhenucev lhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehpihgvrhhrvg essggrrhhrvgdrshhhpdhnsggprhgtphhtthhopedupdhmohguvgepshhmthhpohhuthdp rhgtphhtthhopehpghhsqhhlqdhgvghnvghrrghlsehlihhsthhsrdhpohhsthhgrhgvsh hqlhdrohhrgh X-ME-Proxy: Feedback-ID: i97614980:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id 5FFECB6006B; Thu, 17 Jul 2025 18:59:30 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface MIME-Version: 1.0 X-ThreadId: T89c86ea8eb4c36ce Date: Fri, 18 Jul 2025 00:57:47 +0200 From: "Pierre Barre" To: pgsql-general@lists.postgresql.org Message-Id: Subject: PostgreSQL on S3-backed Block Storage with Near-Local Performance Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi everyone, I wanted to share a project I've been working on that enables PostgreSQL= to run on S3 storage while maintaining performance comparable to local = NVMe. The approach uses block-level access rather than trying to map fil= esystem operations to S3 objects. ZeroFS: https://github.com/Barre/ZeroFS # The Architecture ZeroFS provides NBD (Network Block Device) servers that expose S3 storag= e as raw block devices. PostgreSQL runs unmodified on ZFS pools built on= these block devices: PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 By providing block-level access and leveraging ZFS's caching capabilitie= s (L2ARC), we can achieve microsecond latencies despite the underlying s= torage being in S3. ## Performance Results Here are pgbench results from PostgreSQL running on this setup: ### Read/Write Workload ``` postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) starting vacuum...end. transaction type: scaling factor: 50 query mode: simple number of clients: 50 number of threads: 15 maximum number of tries: 1 number of transactions per client: 100000 number of transactions actually processed: 5000000/5000000 number of failed transactions: 0 (0.000%) latency average =3D 0.943 ms initial connection time =3D 48.043 ms tps =3D 53041.006947 (without initial connection time) ``` ### Read-Only Workload ``` postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S exam= ple pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) starting vacuum...end. transaction type: scaling factor: 50 query mode: simple number of clients: 50 number of threads: 15 maximum number of tries: 1 number of transactions per client: 100000 number of transactions actually processed: 5000000/5000000 number of failed transactions: 0 (0.000%) latency average =3D 0.121 ms initial connection time =3D 53.358 ms tps =3D 413436.248089 (without initial connection time) ``` These numbers are with 50 concurrent clients and the actual data stored = in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, whi= le cold data comes from S3. ## How It Works 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can = use like any other block device 2. Multiple cache layers hide S3 latency: a. ZFS ARC/L2ARC for frequently accessed blocks b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD de= vices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block= device c. Optional local disk cache 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree ## Geo-Distributed PostgreSQL Since each region can run its own ZeroFS instance, you can create geogra= phically distributed PostgreSQL setups. Example architectures: Architecture 1 PostgreSQL Client | | SQL queries | +--------------+ | PG Proxy | | (HAProxy/ | | PgBouncer) | +--------------+ / \ / \ Synchronous Synchronous Replication Replication / \ / \ +---------------+ +---------------+ | PostgreSQL 1 | | PostgreSQL 2 | | (Primary) |=E2=97=84------=E2=96=BA| (Standby) | +---------------+ +---------------+ | | | POSIX filesystem ops | | | +---------------+ +---------------+ | ZFS Pool 1 | | ZFS Pool 2 | | (3-way mirror)| | (3-way mirror)| +---------------+ +---------------+ / | \ / | \ / | \ / | \ NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 | | | | | | +--------++--------++--------++--------++--------++--------+ |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| +--------++--------++--------++--------++--------++--------+ | | | | | | | | | | | | S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6 (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) Architecture 2: PostgreSQL Primary (Region 1) =E2=86=90=E2=86=92 PostgreSQL Standby (Reg= ion 2) \ / \ / Same ZFS Pool (NBD) | 6 Global ZeroFS | S3 Regions The main advantages I see are: 1. Dramatic cost reduction for large datasets 2. Simplified geo-distribution=20 3. Infinite storage capacity 4. Built-in encryption and compression Looking forward to your feedback and questions! Best, Pierre P.S. The full project includes a custom NFS filesystem too.