Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1v3vjw-005kuH-0A for pgsql-hackers@arkaria.postgresql.org; Wed, 01 Oct 2025 12:15:24 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1v3vjt-001crB-Ru for pgsql-hackers@arkaria.postgresql.org; Wed, 01 Oct 2025 12:15:22 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1v3vjt-001cr3-GQ for pgsql-hackers@lists.postgresql.org; Wed, 01 Oct 2025 12:15:22 +0000 Received: from mail-ej1-x62f.google.com ([2a00:1450:4864:20::62f]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1v3vjq-0014sJ-2k for pgsql-hackers@lists.postgresql.org; Wed, 01 Oct 2025 12:15:21 +0000 Received: by mail-ej1-x62f.google.com with SMTP id a640c23a62f3a-afcb78ead12so1276727766b.1 for ; Wed, 01 Oct 2025 05:15:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1759320917; x=1759925717; darn=lists.postgresql.org; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=juQRWwesQTcF3w52MxyD+cZCUBev7RA6oGvOdmo3Sdo=; b=J9+Qw5egwyy1JSRHmYmPVgpw3O8X2HK5siPA8n+BQaRo6yDyi2ajlyO50KX2NfNKtu 024lD+QLtdMFEvodp0txS0IegVHFEnzoBX2eaAqeiELECE+mYaiGGfin0TzOi6s1xIA5 M27Bz8ASPXKisblMlz01UHqZ0aPjP0Lo+cge0NcQkDRMJo0ifH3yjJZoS6FDYoPlCk33 sk4dkUL5sh6rNFX+286T3Eeu+MM4/AA2YLw45QFIdFv69NIgCzx6DvrJn9YAhxoRDcHK ItSVauNCyemLa3LrqmUQl7fzgeOwaHn+WF65L7eFKhwVvINKPjtIDpRrqKFqf684KGTk P9bA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1759320917; x=1759925717; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=juQRWwesQTcF3w52MxyD+cZCUBev7RA6oGvOdmo3Sdo=; b=vkFQa49qDI1RwLEgwaYKhcr4q00PbtRSrteCHA7iX8J5pptiErxDIFM+0PWpuMs5VR RvJzAicMO1sjVWXIi78D1Q+4rclbVo+I1fai+V9oF6Njgf9WlRTIkm/Om2uuiagdN0dz RZxkpXQLkO+6MMbF9Ss+OZ59wgspri3L6Tm/73EL5kR1pRUQ2gsECqRj5C69Er4EQ7G8 EBe/h2YZCEDSwZ3aovfrHTgWRIiLH+0ZI3xSiVd9HD8xh4YUFN+vi1zGslQcCSbYLS5F /VERVSoDg75q8Zn8JZFs7w+laxpS+Dmq/wveEYqzb/ZbDCCBU+N8Bfm9A7XrNRHedIFe P3wQ== X-Forwarded-Encrypted: i=1; AJvYcCUaiW/eSsdG6EoAH0W2pODrkALmenThGhMLIpaSss13HGeFBHbXwAfOT6vVi+06ylT5xoLPu9Wfnyj6DFmU@lists.postgresql.org X-Gm-Message-State: AOJu0YyXNE8zUjbKbc63dKyY6IlQZRBFkFKLNwYqid8wmwYPV9kxzdoT jUnDPBv4OBIa2CZO4vp2GWXNbrp6jffWWdxKNek008IGK9aQB8/09X2k X-Gm-Gg: ASbGncsNxYMrhNVRsibs6p3czqzc5sjdEv2kieatMu63XQLTOk/bbXx7HEDAqqutUec qTTJYsEIPKJOYJTUZcl54kqaeCUidLmVRrKKdvM/HExWwCheZyxT+8PZEJMECaBIgGTUBZJg0bb RabZc6oc8uVtlo4zTCO2Ce2qbINL53apXP7KTWwWwL69v1Vl9k07Vz4R92d01RDWuo12v4UUmgS Lho+mDca8dcSxXYFj5x+HXb0jT0dcQrUTCDYNJsNNpsEplUJj3VNCayiSTntaF6EAgdcMDVfJ5h /klsBIWRb+64Gz3/xUetLFEVhjaVsRcnsTtXNaYAdBRcLn/52tCWE1ZFBu55lsLlZ5wjRo6vf/U I/pL81t7c1XNNlzynNaNFoBIqpr1Cnapp7DdL+wiZeZAYijwszJXcsysxfNlyT6mS8l3Uk0LgKz 1k86HtdBmVoiF37bXnuWws1TtjfTv0GOLX8ovTyhrwN+jlkk+NNSeYU55oE2yulE5VpBV4keR90 cVQYCLMihQu5x0= X-Google-Smtp-Source: AGHT+IG1gtFoWcAK9U9KGEVK0eBAogc9apmSEdtrymABrB0e/GpM9dJ2YEHmUjsOBgMwp85GTTNo0g== X-Received: by 2002:a17:907:6092:b0:b3e:9818:30e0 with SMTP id a640c23a62f3a-b46e631d038mr395207866b.29.1759320917256; Wed, 01 Oct 2025 05:15:17 -0700 (PDT) Received: from smtpclient.apple (2001-1c04-0681-7700-f0c6-582f-3a73-786e.cable.dynamic.v6.ziggo.nl. [2001:1c04:681:7700:f0c6:582f:3a73:786e]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-b3c62d44b3asm793425466b.72.2025.10.01.05.15.16 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 01 Oct 2025 05:15:16 -0700 (PDT) From: Frits Hoogland Message-Id: Content-Type: multipart/alternative; boundary="Apple-Mail=_D05460DC-2BDB-446E-A03D-C60D2A8A3584" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.100.1.1.5\)) Subject: Re: The ability of postgres to determine loss of files of the main fork Date: Wed, 1 Oct 2025 14:15:06 +0200 In-Reply-To: Cc: Michael Banck , Tom Lane , Aleksander Alekseev , pgsql-hackers@lists.postgresql.org To: Jakub Wartak References: <013D63E2-5D75-492E-85FF-1D5CC0148C82@gmail.com> <499686.1759250489@sss.pgh.pa.us> <68dcd1f2.df0a0220.3300c0.f7af@mx.google.com> X-Mailer: Apple Mail (2.3864.100.1.1.5) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --Apple-Mail=_D05460DC-2BDB-446E-A03D-C60D2A8A3584 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Thank you for looking into this Jakub, Aleksander, Michael, Tom, There are more scenario's where files might be missing: how about a = backup somehow missing files? But also deliberate sabotage, you can hide data in this way, and it = would be hard to detect, and even harder to understand what happened, why it happened and how it happened. = I dare to say quite ideal if you want a name to be removed from a database a covert way, and know = something about the database and have access, or somehow can manipulate in the data = directory. Without having the exact sourcecode files and functions handy, the way I = think using the segments works in postgres, is that the database, because of the absence of any = indicator of size, will scan=20 the relfilenode, and when it reaches the set limit (1GB), it will try to = open the next segment. If that=20 segment doesn't exist, the database assumes it's the end of the segment. = That is all. That is why there is no error: it's exactly alike when there truly is no more data, and = there is no metadata to understand there is supposed to be more data. And therefore I talked about the primary key: that contains "indirect" = data to be able to detect if a table has a missing segment. But I think really what is needed is a max = page number to indicate the known size. Frits Hoogland > On 1 Oct 2025, at 13:20, Jakub Wartak = wrote: >=20 > On Wed, Oct 1, 2025 at 9:02=E2=80=AFAM Michael Banck = wrote: >>=20 >> Hi, >>=20 >> wow, this is one of the most terrifying threads I've ever seen... >=20 > Same. >=20 >> On Tue, Sep 30, 2025 at 12:41:29PM -0400, Tom Lane wrote: >>> Aleksander Alekseev writes: >>>>> Therefore, I would like to request an enhancement: add an option = to >>>>> verify_heapam() that causes the primary key index to be scanned = and makes >>>>> sure that all line pointers in the index point to existing tuples. >>>=20 >>>> ... IMO there is little value in adding a check for the existence = of >>>> the segments for a single table. And the *real* check will not = differ >>>> much from something like SELECT * FROM my_table, or from making a >>>> complete backup of the database. >>>=20 >>> As Frits mentioned, neither of those actions will really notice if a >>> table has been truncated via loss of a segment. >>=20 >> Is there a valid case for a missing segment? If not, couldn't this be >> caught somewhere in the storage manager? >>=20 >=20 > I've took a look on PG17 and in _mfd_openseg() there's if fd < 0 > return NULL after open(), but out of it's callers only _mdfd_getseg() > seems to be alerting on that NULL. To me this seems like a bug, > because i've seen way too many times people and software deleting > files randomly. Even simple crashes (with e2fsck, xfs_repair) could > put orphaned inodes into /lost+found. IMHO all files should be opened > at least on startup to check integrity, because the non-zero return > code (during such SELECT) for openat(2) seems o be coming out of > = RelationGetNumberOfBlocksInFork()->table_block_relation_size()->smgrnblock= s()->mdnblocks()->_mdfd_openseg(). > Now if the 1st seg file would be missing we would complain in > mdopenfork(). mdnblocks() says even "all active segments of the > relation are opened...", but even that apparently is not true. >=20 > The bigger context seems to be be that 049469e7e7cfe0c69 (2015) could > be culprit here as well, as it is stated there that mdnblocks() could > earlier create zero-length files back in day and it removed that > ereport(ERROR) when unable to access that file. >=20 > Another idea (than this being a bug) is that Thomas had a large > relation patchset back in [1], but I wouldn't be a fan of us operating > on 31-32TB files ;) >=20 > -J. >=20 > [1] - = https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf2= 5y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com --Apple-Mail=_D05460DC-2BDB-446E-A03D-C60D2A8A3584 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Thank you for looking into this Jakub, = Aleksander, Michael, Tom,

There are more scenario's = where files might be missing: how about a backup somehow missing = files?
But also deliberate sabotage, you can hide data in this = way, and it would be hard to detect, and even
harder to = understand what happened, why it happened and how it happened. I dare to = say quite ideal
if you want a name to be removed from a = database a covert way, and know something about the
database = and have access, or somehow can manipulate in the data = directory.

Without having the exact sourcecode = files and functions handy, the way I think using the = segments
works in postgres, is that the database, because of = the absence of any indicator of size, will scan 
the = relfilenode, and when it reaches the set limit (1GB), it will try to = open the next segment. If that 
segment doesn't exist, = the database assumes it's the end of the segment. That is all. That is = why there
is no error: it's exactly alike when there truly is = no more data, and there is no metadata to understand
there is = supposed to be more data.

And therefore I = talked about the primary key: that contains "indirect" data to be able = to detect if a
table has a missing segment. But I think really = what is needed is a max page number to indicate
the known = size.



Frits = Hoogland




On 1 Oct 2025, at 13:20, Jakub = Wartak <jakub.wartak@enterprisedb.com> wrote:

On Wed, Oct 1, 2025 at = 9:02=E2=80=AFAM Michael Banck <mbanck@gmx.net> = wrote:

Hi,

wow, this is one of = the most terrifying threads I've ever = seen...

Same.

On = Tue, Sep 30, 2025 at 12:41:29PM -0400, Tom Lane wrote:
Aleksander Alekseev <aleksander@tigerdata.com> = writes:
Therefore, = I would like to request an enhancement: add an option = to
verify_heapam() that causes the primary key index to be scanned = and makes
sure that all line pointers in the index point to existing = tuples.

... = IMO there is little value in adding a check for the existence of
the = segments for a single table. And the *real* check will not = differ
much from something like SELECT * FROM my_table, or from = making a
complete backup of the database.

As = Frits mentioned, neither of those actions will really notice if = a
table has been truncated via loss of a = segment.

Is there a valid case for a missing = segment? If not, couldn't this be
caught somewhere in the storage = manager?


I've took a look on PG17 and in = _mfd_openseg() there's if fd < 0
return NULL after open(), but out = of it's callers only _mdfd_getseg()
seems to be alerting on that = NULL. To me this seems like a bug,
because i've seen way too many = times people and software deleting
files randomly. Even simple = crashes (with e2fsck, xfs_repair) could
put orphaned inodes into = /lost+found. IMHO all files should be opened
at least on startup to = check integrity, because the non-zero return
code (during such = SELECT) for openat(2) seems o be coming out = of
RelationGetNumberOfBlocksInFork()->table_block_relation_size()-&g= t;smgrnblocks()->mdnblocks()->_mdfd_openseg().
Now if the 1st = seg file would be missing we would complain in
mdopenfork(). = mdnblocks() says even "all active segments of the
relation are = opened...", but even that apparently is not true.

The bigger = context seems to be be that 049469e7e7cfe0c69 (2015) could
be culprit = here as well, as it is stated there that mdnblocks() could
earlier = create zero-length files back in day and it removed = that
ereport(ERROR) when unable to access that file.

Another = idea (than this being a bug) is that Thomas had a large
relation = patchset back in [1], but I wouldn't be a fan of us operating
on = 31-32TB files ;)

-J.

[1] - = https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf2= 5y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com

= --Apple-Mail=_D05460DC-2BDB-446E-A03D-C60D2A8A3584--