Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1neJDd-0005V3-4X for pgsql-admin@arkaria.postgresql.org; Tue, 12 Apr 2022 16:18:17 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1neJDc-00082h-0B for pgsql-admin@arkaria.postgresql.org; Tue, 12 Apr 2022 16:18:16 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1neJDb-00082Y-Lm for pgsql-admin@lists.postgresql.org; Tue, 12 Apr 2022 16:18:15 +0000 Received: from mailout.easymail.ca ([64.68.200.34]) by magus.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1neJDY-0006fu-UF for pgsql-admin@lists.postgresql.org; Tue, 12 Apr 2022 16:18:15 +0000 Received: from localhost (localhost [127.0.0.1]) by mailout.easymail.ca (Postfix) with ESMTP id 7BEADA3FBA for ; Tue, 12 Apr 2022 16:18:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at emo03-pco.easydns.vpn Received: from mailout.easymail.ca ([127.0.0.1]) by localhost (emo03-pco.easydns.vpn [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id M0tBBeEnSVoF for ; Tue, 12 Apr 2022 16:18:10 +0000 (UTC) Received: from smtpclient.apple (unknown [4.28.96.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mailout.easymail.ca (Postfix) with ESMTPSA id 176D3A3F8E for ; Tue, 12 Apr 2022 16:18:10 +0000 (UTC) From: Scott Ribe Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 15.0 \(3693.60.0.1.1\)) Subject: regarding PG on ZFS performance Message-Id: <46F40448-CB5C-4D86-806D-27CF791A364F@elevated-dev.com> Date: Tue, 12 Apr 2022 10:18:09 -0600 To: pgsql-admin X-Mailer: Apple Mail (2.3693.60.0.1.1) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Just re-ran some older tests with the current version of Ubunutu & ZFS = (but older kernel thanks to a multi-way incompatibility with other = things). Results are that with proper tuning, ZFS RAIDZ1 on 4 NVMe = drives gives higher TPS on pgbench at scale 10,000 than XFS on one of = the same NVMe--but the initial population of the db takes 25% longer. Proper tuning: PG full_page_writes off (for ZFS, on for NVMe); ZFS lz4 = compression, 64K recordsize, relatime db created by: pgbench -i -s 10000 --foreign-keys test benchmarked as: pgbench -c 100 -j 4 -t 1000 test NVMe: 31,804 TPS RAIDZ1: 50,228 TPS Some other notes: - the situation is reversed, single NVMe is faster when using 10 = connections instead of 100 - these tests are all from within containers running on Kubernetes--pg = server and client in same container, connected over domain sockets - 256GB and 48 CPU pod limits--running where there's still the cgroup = double-counting bug, so CPU is theoretically throttled to ~24, leaving = ~20 to PG server - the container is actually getting very slightly throttled at barely = over 20 CPU--so not sure if it's CPU-bound or IO-bound - PG settings are set up for a larger database, shared_buffers, = work_mem, parallel workers, autovacuum, etc - I'd read that because of the way ZFS handles RAIDZ1 compared to RAID5, = that performance probably didn't suffer relative to RAID10, and this is = the case--tests with ZFS RAID10 on the same drives were a tiny bit = slower (2-3%) than RAIDZ1 for TPS, but a bit faster on initial = population (6-8%) - as an aside, WekaFS = (https://www.aspsys.com/solutions/storage-solutions/weka-io/) is about = 10% faster than RAIDZ1 (both TPS and initial fill) I hope that experience from someone who actually bothered to read up on = how to configure ZFS for PG can put to rest some "ZFS is too slow" = misinformation. I am certain that ZFS is not nearly the fastest for all = configurations (for instance, I am unable to configure the 4 NVMe drives = into a hardware RAID10 to test, and it seems that ZFS may not scale well = to larger numbers of disks) but "too slow to ever be consider for = serious work" is flat-out wrong. -- Scott Ribe scott_ribe@elevated-dev.com https://www.linkedin.com/in/scottribe/