Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1v5hOI-00AUw9-35 for pgsql-hackers@arkaria.postgresql.org; Mon, 06 Oct 2025 09:20:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1v5hOF-006qx0-RE for pgsql-hackers@arkaria.postgresql.org; Mon, 06 Oct 2025 09:20:20 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1v5hOF-006qws-Gi for pgsql-hackers@lists.postgresql.org; Mon, 06 Oct 2025 09:20:20 +0000 Received: from mail-lj1-x235.google.com ([2a00:1450:4864:20::235]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1v5hOD-000eN1-1H for pgsql-hackers@lists.postgresql.org; Mon, 06 Oct 2025 09:20:19 +0000 Received: by mail-lj1-x235.google.com with SMTP id 38308e7fff4ca-33c9f2bcdceso26434461fa.2 for ; Mon, 06 Oct 2025 02:20:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1759742410; x=1760347210; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=MMj6HUs9on7IIxm1BPTxwxkgCj3ewrW1rHWYxIGDWcc=; b=hCYs23Xcw1Y3aUZqdkmvRSkx8raMlV0ZrKkwZgd3gk1lNyf4iTqycrsI8VCCcfFud9 GA3CJJTEPXshDKmLFQc79fdrm0+EPmg4uD1dw5HnrwzJKLQq8y3Xn2ARrEgiRpqmn7Fx j5XmQmp5e0HYKRi68adl/Qm1+xnI7MUeL0WdAf+CBGhVDcvYJXLh72uLPMqC4Rn9xtNl F6UdTVxb1sxST7TtfD74Keh8yLfCwmvLiV9KJ3joVdtTwQii8lnPxgnmcoskMnMdTPVg 6vcCTE650q0U0Sk2EMxwbgvtjLKvBMyf6MFCId/N7QiAzVynSP7kVvUDLwPOTx19Exy1 akiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1759742410; x=1760347210; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MMj6HUs9on7IIxm1BPTxwxkgCj3ewrW1rHWYxIGDWcc=; b=RGgoHoY8KVofF3bH+cT/MooEWjSJDZmQhi5pqoNYxxHBiKzutRJlOpmAxFMABunKvp ECog+BoeW56oBUQTmhOPlD65vEoLDKOiYJL/lAumu5dOtLLPH3x1bzFLemPtoHmgCWbw iMX32e2Yi/SHyZQ4xyDj8dUsyIriO+NBjApcGH221b1QGBTT8CdvCEN0Co72+ToUiU5h 4K6jkoq0IGmtCJz3x8+vLfAePM7Hn00vw8LEBQLXncbUBvKupPfpPPdgrffSfKpmhqGa kdjFKy6QGNGG6GB5adSllprROdhGLtf7Dcou8Ss/zvQj6QJR9vGluJygDjyZFme04xHz ChAQ== X-Forwarded-Encrypted: i=1; AJvYcCUWjHq3f1jQ4Ma4pHb1Wmh232/CWL/gnXjVikL/czBV8v3IGxZpI4TQo9YMG7H/PCIsd2JcgX/2QmVCdvZi@lists.postgresql.org X-Gm-Message-State: AOJu0YxSZwdsI7CS+WCBTBtU5H2wSQLAY9sIyY2IFf8zHEMkAow3MrVp PQvW4qMFSXwYDAiYppABXajt0m+ao+CLSZ96d/5ctmQW0g88W2ItaRlG9StVdJrWCG0yZ9LHojL HjoFJdAiJZLOyDXUACI3XwLQzUKgqIwJHBpkFQCxf X-Gm-Gg: ASbGncu+6pEJNLeqzjvmvMA7BlwPTw2KEhRC23R8LTN+lPlVAt4RtfBQlJkb/56Hbx2 sbPQfKlvQiaiq++rAiGYp30216yFIg6r/mMmDFYxR/ejm2eIM0WI7KbS3SqNUoWawkO458UCQnF voK4Z1uunvpX3fbp7D0piNhY/1i7dDJSV+VCp6JJp2v2Gsjxq6EZHhS+zQfskiQ2MEoblHXiwGC 9wDWjs4Q/8RwJE4DklQcKnvH8YquI55 X-Google-Smtp-Source: AGHT+IEsbPRWgSW299o3dZ7rLB/2sVUZSnv1UnCd3lgsor1K6YrFMj/OkOha+b/uUzF/h0f6WftlPxwMYDcOodhJ510= X-Received: by 2002:a2e:a543:0:b0:372:9223:9efb with SMTP id 38308e7fff4ca-374c3899d5fmr31329751fa.45.1759742410496; Mon, 06 Oct 2025 02:20:10 -0700 (PDT) MIME-Version: 1.0 References: <013D63E2-5D75-492E-85FF-1D5CC0148C82@gmail.com> <499686.1759250489@sss.pgh.pa.us> <68dcd1f2.df0a0220.3300c0.f7af@mx.google.com> <68dd1b79.170a0220.3c4175.198f@mx.google.com> In-Reply-To: <68dd1b79.170a0220.3c4175.198f@mx.google.com> From: Jakub Wartak Date: Mon, 6 Oct 2025 11:19:41 +0200 X-Gm-Features: AS18NWCPABTu3Q-8KEh6muZvKl_iPC3jeNbN7TFdsrAXHH8oJqTTezzhbzW92mU Message-ID: Subject: Re: The ability of postgres to determine loss of files of the main fork To: Michael Banck Cc: Aleksander Alekseev , pgsql-hackers@lists.postgresql.org, Tom Lane , Frits Hoogland Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Wed, Oct 1, 2025 at 2:15=E2=80=AFPM Michael Banck wrote= : > > > That might be a lot of files to open. > > > > I was afraid of that, but let's say modern high-end is 200TB big DB, > > that's like 200*1024 1GB files, but I'm getting such time(1) timings > > for 204k files on ext4: > > > > $ time ./createfiles # real 0m2.157s, it's > > open(O_CREAT)+close() > > $ time ls -l many_files_dir/ > /dev/null # real 0m0.734s > > $ time ./openfiles # real 0m0.297s , for > > already existing ones (hot) > > $ time ./openfiles # real 0m1.456s , for > > already existing ones (cold, echo 3 > drop_caches sysctl) > > > > Not bad in my book as a one time activity. It could pose a problem > > potentially with some high latency open() calls, maybe NFS or > > something remote I guess. > > Yeah, did you try on SAN as well? I am doubtful that will be performant. OK you I guess you wanted to have some latency + big fs size, so I do not have real SAN here, but I've tried that on noisy Azure's Cloud 4x disk Standard __HDD__ (probably SSD/flash but just throttled way down). Each LUN of size: 32TB (max), giving me in total 128TB striped fs: - lvcreate -L 128G -n lv -i 4 -I 1M vgbig + mkfs.xfs - ioping reports there the following latencies for Direct IO: min/avg/max/mdev =3D 3.25 ms / 18.0 ms / 193.8 ms / 39.5 ms (yay!) - so as per above it's doesn't have low latency at all - simple fio (fio --name=3Drand_read_iops --filename=3D/xfs/many_files_dir/test1 --rw=3Drandread --bs=3D4k --direct= =3D1 --ioengine=3Dio_uring --iodepth=3D64 --size=3D4G --runtime=3D60 --time_base= d --group_reporting) reports just: IOPS=3D959, BW=3D3837KiB/s Yet, to create/open 204k empty files to simulate that 200TB cluster on that 128TB fs: $ time ./createfiles # real 0m2.140s $ time ls -l /xfs/many_files_dir/ > /dev/null # real 0m0.697s $ time ./openfiles # real 0m0.440s (hot) $ time ./openfiles # real 0m29.872s (after 3 to drop_caches) $ time ./openfiles # real 0m0.443s (hot) $ time ./openfiles # real 0m31.050s (after 3 to drop_caches) $ echo 3 > /proc/sys/vm/drop_caches $ time ls -l /xfs/many_files_dir/ > /dev/null # real 1m2.345s $ time ./openfiles # real 0m0.437s (hot) Anyway, I do not know if opening all the files on startup (or just crash-recovery?) is the proper way, personally I would take +30s to open a 200TB database any day over the risk of silent data loss and wrong results given the option. I'm sure however that ignoring missing files on access is certainly not how things should be looking like and I would speculate that you might even get some accidents that files may go missing when up & running and you won't be notified in any way (due to human error: imagine someone rsyncing the wrong [opposite] direction he wanted by accident or something like that). -J.