public inbox for [email protected]
help / color / mirror / Atom feedFrom: Tom Lane <[email protected]>
To: Andres Freund <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Kuntal Ghosh <[email protected]>
Cc: Michael Paquier <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: subscriptionCheck failures on nightjar
Date: Fri, 20 Sep 2019 17:49:27 -0400
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <CA+TgmoaNOMG9+Ho9d3CX+-10O7+nqqvmSpXb1m0F3dqWB4C-8g@mail.gmail.com>
<[email protected]>
<20190917194510.iqwyl3be62pz7l27@development>
<[email protected]>
<CAGz5QCJv5JbRDsATDTkJqq7h9F7u0QLnNnLHfxR1nEOa4DnkJQ@mail.gmail.com>
<20190918215808.yonxqgycme6pbctp@development>
<[email protected]>
<CAGz5QC+5_mPFoDj7ZSMV0gwvMY+kdOp4t1w=TTDpzuV9F2-X6g@mail.gmail.com>
<[email protected]>
<[email protected]>
<[email protected]>
Andres Freund <[email protected]> writes:
> On 2019-09-20 16:25:21 -0400, Tom Lane wrote:
>> I recreated my freebsd-9-under-qemu setup and I can still reproduce
>> the problem, though not with high reliability (order of 1 time in 10).
>> Anything particular you want logged?
> A DEBUG2 log would help a fair bit, because it'd log some information
> about what changes the "horizons" determining when data may be removed.
Actually, what I did was as attached [1], and I am getting traces like
[2]. The problem seems to occur only when there are two or three
processes concurrently creating the same snapshot file. It's not
obvious from the debug trace, but the snapshot file *does* exist
after the music stops.
It is very hard to look at this trace and conclude anything other
than "rename(2) is broken, it's not atomic". Nothing in our code
has deleted the file: no checkpoint has started, nor do we see
the DEBUG1 output that CheckPointSnapBuild ought to produce.
But fsync_fname momentarily can't see it (and then later another
process does see it).
It is now apparent why we're only seeing this on specific ancient
platforms. I looked around for info about rename(2) not being
atomic, and I found this info about FreeBSD:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=94849
The reported symptom there isn't quite the same, so probably there
is another issue, but there is plenty of reason to be suspicious
that UFS rename(2) is buggy in this release. As for dromedary's
ancient version of macOS, Apple is exceedinly untransparent about
their bugs, but I found
http://www.weirdnet.nl/apple/rename.html
In short, what we got here is OS bugs that have probably been
resolved years ago.
The question is what to do next. Should we just retire these
specific buildfarm critters, or do we want to push ahead with
getting rid of the PANIC here?
regards, tom lane
view thread (44+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: subscriptionCheck failures on nightjar
In-Reply-To: <[email protected]>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox