public inbox for [email protected]  
help / color / mirror / Atom feed
From: vignesh C <[email protected]>
To: PostgreSQL Hackers <[email protected]>
Subject: Random pg_upgrade 004_subscription test failure on drongo
Date: Thu, 13 Mar 2025 14:34:15 +0530
Message-ID: <CALDaNm3tjY44HoSwY84=XGEbTg0ruVfD4hAMTm=TgBqVysH4Qw@mail.gmail.com> (raw)

Hi,

In one of the recent buildfarm runs, the test
pg_upgrade/004_subscription test fails at [1].
It has failed with the following:
Restoring database schemas in the new cluster
*failure*

Consult the last few lines of
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log"
for
the probable cause of the failure.
Failure, exiting

Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
to see what is the exact reason.
I checked old buildfarm failures and found a similar failure in drongo
for pg_upgarde/003_logical_slots at [2].

The test failure can happen because of the reason mentioned at [3] and [6].
I'm just requoting the contents from there, if you have checked [3]
and [6] skip the below contents:
---------------------------------------------
The failure reason for the earlier failure was found to be:
According to it, the TRUNCATE command seemed to be failed due to a
"File exists" error.
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR:
could not create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:
-- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
...

## Analysis
I think it was caused due to the STATUS_DELETE_PENDING failure, not
related with recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for
pg_largeobject_loid_pn_index, and the
output meant that file creation failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the
file which have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [4] pgunlink sometimes could not remove a
file even if it returns OK, at that time NTSTATUS is
STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
mentions the same thing:

```
/*
* ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
* gone (Windows NT status code is STATUS_DELETE_PENDING).  In that
* case, we'd better ask for the NT status too so we can translate it
* to a more Unix-like error.  We hope that nothing clobbers the NT
* status in between the internal NtCreateFile() call and CreateFile()
* returning.
*
```

The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on
that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
if the deletion is pending but it is trying to open.
---------------------------------------------

This was fixed by the following change in the target upgrade nodes:
bgwriter_lru_maxpages = 0
checkpoint_timeout = 1h

Attached is a patch in similar lines for 004_subscription.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-10%2019%3A26%3A35
[2] - https://www.postgresql.org/message-id/flat/TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A%40TYAPR01MB5866.jp...
[3] - https://www.postgresql.org/message-id/TYAPR01MB5866A4E7342088E91362BEF0F5BBA%40TYAPR01MB5866.jpnprd0...
[4] - https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail...
[5] - https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60...
[6]- https://www.postgresql.org/message-id/f0d303f1-e380-5988-91c7-74b755abd4bb%40gmail.com

Regards,
Vignesh


Attachments:

  [application/octet-stream] 0001-Fix-an-intermetant-BF-failure-in-004_subscription.patch (1.6K, 2-0001-Fix-an-intermetant-BF-failure-in-004_subscription.patch)
  download | inline diff:
From c7865aed84942e76d88fc08fdbce74e2b001cec2 Mon Sep 17 00:00:00 2001
From: Vignesh <[email protected]>
Date: Thu, 13 Mar 2025 14:01:03 +0530
Subject: [PATCH] Fix an intermetant BF failure in 004_subscription.

During upgrade, when pg_restore performs CREATE DATABASE, bgwriter or
checkpointer may flush buffers and hold a file handle for pg_largeobject,
so later TRUNCATE pg_largeobject command will fail if OS (such as older
Windows versions) doesn't remove an unlinked file completely till it's
open.
---
 src/bin/pg_upgrade/t/004_subscription.pl | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/bin/pg_upgrade/t/004_subscription.pl b/src/bin/pg_upgrade/t/004_subscription.pl
index e3ccff9f270..9510e32a7e3 100644
--- a/src/bin/pg_upgrade/t/004_subscription.pl
+++ b/src/bin/pg_upgrade/t/004_subscription.pl
@@ -29,6 +29,19 @@ my $oldbindir = $old_sub->config_data('--bindir');
 # Initialize the new subscriber
 my $new_sub = PostgreSQL::Test::Cluster->new('new_sub');
 $new_sub->init;
+
+# During upgrade, when pg_restore performs CREATE DATABASE, bgwriter or
+# checkpointer may flush buffers and hold a file handle for the system table.
+# So, if later due to some reason we need to re-create the file with the same
+# name like a TRUNCATE command on the same table, then the command will fail
+# if OS (such as older Windows versions) doesn't remove an unlinked file
+# completely till it is open.
+$new_sub->append_conf(
+	'postgresql.conf', q{
+bgwriter_lru_maxpages = 0
+checkpoint_timeout = 1h
+});
+
 my $newbindir = $new_sub->config_data('--bindir');
 
 # In a VPATH build, we'll be started in the source directory, but we want
-- 
2.43.0



view thread (9+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Random pg_upgrade 004_subscription test failure on drongo
  In-Reply-To: <CALDaNm3tjY44HoSwY84=XGEbTg0ruVfD4hAMTm=TgBqVysH4Qw@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox