From: Tom Lane <tgl@sss.pgh.pa.us>
To: Andres Freund <andres@anarazel.de>
cc: Thomas Munro <thomas.munro@enterprisedb.com>,
        PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: subscriptionCheck failures on nightjar
In-reply-to: <20190213171101.6wpz7tardp3t3uvk@alap3.anarazel.de>
References: <17827.1549866683@sss.pgh.pa.us>
 <CAEepm=1pbie9C_PtojGum7qXAAU1hB8JtA6v_9dQFPgay3PcZg@mail.gmail.com>
 <27965.1550077052@sss.pgh.pa.us>
 <20190213171101.6wpz7tardp3t3uvk@alap3.anarazel.de>
Comments: In-reply-to Andres Freund <andres@anarazel.de>
	message dated "Wed, 13 Feb 2019 09:11:01 -0800"
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <29707.1550079455.1@sss.pgh.pa.us>
Date: Wed, 13 Feb 2019 12:37:35 -0500
Message-ID: <29708.1550079455@sss.pgh.pa.us>
Precedence: bulk

Andres Freund <andres@anarazel.de> writes:
> On 2019-02-13 11:57:32 -0500, Tom Lane wrote:
>> I've managed to reproduce this locally, and obtained this PANIC:

> Cool. How exactly?

Andrew told me that nightjar is actually running in a qemu VM,
so I set up freebsd 9.0 in a qemu VM, and boom.  It took a bit
of fiddling with qemu parameters, but for such a timing-sensitive
problem, that's not surprising.

>> Anyway, I think we might be able to fix this along the lines of
>> [ fsync the data before renaming not after ]

> Hm, but that's not the same? On some filesystems one needs the directory
> fsync, on some the file fsync, and I think both in some cases.

Now that I look at it, there's a pg_fsync() just above this, so
I wonder why we need a second fsync on the file at all.  fsync'ing
the directory is needed to ensure the directory entry is on disk;
but the file data should be out already, or else the kernel is
simply failing to honor fsync.

>> The existing code here seems simply wacky/unsafe to me regardless
>> of this race condition: couldn't it potentially result in a corrupt
>> snapshot file appearing with a valid name, if the system crashes
>> after persisting the rename but before it's pushed the data out?

> What do you mean precisely with "before it's pushed the data out"?

Given the previous pg_fsync, this isn't an issue.

>> I also wonder why bother with the directory sync just before the
>> rename.

> Because on some FS/OS combinations the size of the renamed-into-place
> file isn't guaranteed to be durable unless the directory was
> fsynced.

Bleah.  But in any case, the rename should not create a situation
in which we need to fsync the file data again.

			regards, tom lane