Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vKIxT-006reN-17 for pgsql-general@arkaria.postgresql.org; Sat, 15 Nov 2025 16:17:03 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vKIxQ-009lOe-20 for pgsql-general@arkaria.postgresql.org; Sat, 15 Nov 2025 16:17:00 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vKIxQ-009lOV-0U for pgsql-general@lists.postgresql.org; Sat, 15 Nov 2025 16:17:00 +0000 Received: from mclp1s2.mcl.gg ([2a09:e1c1:efc1:1337::25]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vKIxN-007uJL-1p for pgsql-general@lists.postgresql.org; Sat, 15 Nov 2025 16:16:59 +0000 Message-ID: <5fd60425-db26-4700-b716-5be3762acd33@menzel.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=menzel.de; s=mcl-2022122401-rsa; t=1763223414; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=exLJeiSngA7qH0yDvHqcdJu62LyWjjBvEAMGezTeXvI=; b=KSK2qjRqHhqGGlIvRZeAnSKQnkQmaKUvA3cJe+98jksbPiCLKQ7v9yLXI6kXRB3JRfWDbe PEK2JPN0dWG/b2RgKsu4myef/Exh/ft+6kGXYSoiDdbbuovBzmOCW29xE73+/ELlUVSOO7 2ebvkZ1b8C5ByYsDoRodtcEY3a/fsSsm8wBArKYY1KrLtMYjrm8AHt42B1R2Y9tqWuo7re RWOOMnHCdVKqB0vGL8emyPftpB1E4UbPAUODNOV256ksRbIcegfzaOuctHN40KcwuehleY ZmbS0AOOYoPAcH9pYCg31c8miWtR76VqpLLOleBTd4LvLmHlnYoOHskFGXvEtKaiTsDdl4 f6tjJqNQDkSYmJwoy84V0Mx4UfyqdGBd3v+73FFXRhk/WGI2Fr4O1tprO1DcSa8XKhnL5y 5Vf/pW6phV1rDVIBKPJGAtPwuW1hcNm7xdFF+5KGrWmGY26A2Ma7DUQA9/UrpjYRxFDwRr dGgPmK/9fidqYOmSFXCJiiwT4wI7o71vghxCq3zsKbDTxItbRr0WjM4ted86PQjy1ZaFn4 k+tLdUNHfDhhceXT/6ZnyhvXLSPcQgEqvHM7xx1wEc0rxJ4OVW+zYcIShpLLhnzmKgt3Ey uZu04VRor1zqjL1EgOzKpjPXdqQv9JD7zBpjwMi35gihMGAvU76g4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=menzel.de; s=mcl-2022122401-ed25519; t=1763223414; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=exLJeiSngA7qH0yDvHqcdJu62LyWjjBvEAMGezTeXvI=; b=HY3dULKsj/ZIjP2KJsX0YbVq5R5QyLVclmq7EMRw4Uomq4H4q8QNW7HEpaAqxLy5jVb+fF ynMYAbVW1Lbf3UCg== Date: Sat, 15 Nov 2025 17:16:56 +0100 MIME-Version: 1.0 Subject: Re: pg_upgrade reflink support on OpenZFS To: Thomas Munro Cc: pgsql-general@lists.postgresql.org References: Content-Language: en-US From: Marcel Menzel In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 15/11/2025 05:17, Thomas Munro wrote: > On Sat, Nov 15, 2025 at 7:16 AM Marcel Menzel wrote: >> For the PostgreSQL upgrade to version 18, I took the opportunity to test >> the reflink support in pg_upgrade (with --clone) on OpenZFS 2.3.4 / >> Linux 6.15.11 and it worked flawlessly, being a huge time saver here. > > Nice! > >> I've looked into the documentation for pg_upgrade and it's only >> mentioning btrfs and XFS on Linux and not FreeBSD at all, so I thought >> It'd be an interesting heads-up to report that Linux gained a 3rd FS and >> also I think FreeBSD in general the ability for doing reflink copies. > > It does mention both Linux and FreeBSD under --copy-file-range. I > didn't try to list all the relevant file systems there though, partly > because I didn't feel like documenting all the quirks (only works if > you created your XFS file system with the feature enabled, might need > to frobnicate ZFS sysctl, which NFS clients and servers can push it > down, likewise for non-COW file systems and device drivers, etc etc). > It might be nice to find a decent reference for all that stuff > somewhere else and point to it, but I don't think we can maintain that > accurately ourselves. > > I was actually surprised to hear that ioctl(dest_fd, FICLONE, src_fd) > worked for you. I knew that it was really BTRFS's ioctl and XFS > accepted it too, but I didn't know that ZFS also understood it[1] in > 2.3. They apparently didn't really expect anyone to call it, and > since ZFS 2.4 is apparently about to ship without it[2], it seems like > a bad time to add it to the documentation for --clone. Oh, I haven't had any looks at upcoming versions yet, but yeah this doesn't make any sense then to mention this. >> OpenZFS has been supporting this since 2.2 but has had it disabled due >> to data corruption bugs, now since 2.3 the sysctl (zfs_bclone_enabled on >> Linux, vfs.zfs.bclone_enabled on FreeBSD) has been enabled by default so >> only the zpool feature "block_cloning" has to be enabled, which might be >> the case when running "zpool upgrade". > > Yeah, those data corruption reports (which turned out to be > misattributed IIRC?) provided one reason to keep the old BTRFS ioctl() > under --clone but add the new behaviour under --copy-file-range. > --copy-file-range should work for all COW filesystems on Linux via > proper VFS entrypoints, and is the official way to do this from user > space. Perhaps we should eventually harmonise this under a single > option and drop the ioctl() stuff. One semantic change would be that > copy_file_range() means "copy with your best trick" (could be cloning, > network/driver pushdown or user space buffer copy, silently selecting > the behaviour), while the BTRFS ioctl() means "clone or fail" IIRC, so > that was another reason to want a separate option for now. I haven't looked close at the copy_file_range() syscall and how tools interact with it in detail yet, but I've found this[3] interesting GitHub comment which gives me a clearer picture now. Totally understandable why the OpenZFS remove the compat for those BTRFS syscalls since they now have a proper replacement. Peeking at the OpenZFS docs[4][5], they also mention the copy_file_range() syscall invoking the BRT, so I guess I'll use pg_upgrade with --copy-file-range the next time. > For reference, the macOS copyfile() call used for --clone has flags > that should cause it to fail if it can't clone IIUC, while the Windows > CopyFile() call used for --copy might even clone blocks on ReFS even > if you don't specify --clone... huh. > >> I haven't had the possibility to check this on FreeBSD yet, but I don't >> see why this should not work as I also can't spot anything in the >> OpenZFS docs regarding reflink / block cloning limitations on FreeBSD. >> Also I saw one of the OpenZFS devs writing on Reddit about block cloning >> being supported on FreeBSD v14. > > It always succeeds on FreeBSD, but it only actually clones if you set > vfs.zfs.bclone_enabled=1. I've tested all our "clone" features with > that and they work nicely. The sysctl wasn't on by default in FreeBSD > 14.x, but 15 is about to ship and the "experimental" label was removed > in man 4 zfs. > > If you haven't seen them yet, you might also like these COW tricks: > > Shared storage of basic catalog tables when you have a lot of databases: > SET file_copy_method = CLONE; > CREATE DATABASE ... STRATEGY=FILE_COPY; > > Fast database clone/snapshot of very large databases (caveats: users > can't be connected to source, checkpoint forced): > SET file_copy_method = CLONE; > CREATE DATABASE ... STRATEGY=FILE_COPY TEMPLATE=source_db; > > Combine a chain of incremental backups and a full backup to produce a > new full backup, sharing disk blocks with the ancestor backups: > pg_combinebackup --copy-file-range > > That last one is a really powerful use of copy_file_range()'s subfile > cloning powers. Another subfile cloning trick I've proposed before is > making relation segment size user-controllable, and then allowing > pg_upgrade to migrate between segment sizes by splicing them together. Oh, those are really handy commands, especially the last one, yes. Many thanks for pointing these out! > [1] https://github.com/openzfs/zfs/commit/9927f219f1e9f4ee886d426190500abf5b1d602e > [2] https://github.com/openzfs/zfs/commit/4800181b3b950d67a62aca7c9e28d34c8b303242 [3] https://github.com/openzfs/zfs/pull/13392#issuecomment-1742172842 [4] https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#block_cloning [5] https://openzfs.github.io/openzfs-docs/man/master/7/zfsconcepts.7.html#Block_cloning