Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gtyTh-0001U1-KM for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Feb 2019 17:37:45 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.89) (envelope-from ) id 1gtyTg-0005MC-6x for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Feb 2019 17:37:44 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gtyTf-0005M4-Sn for pgsql-hackers@lists.postgresql.org; Wed, 13 Feb 2019 17:37:44 +0000 Received: from sss.pgh.pa.us ([66.207.139.130]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gtyTd-0005Ni-9h for pgsql-hackers@lists.postgresql.org; Wed, 13 Feb 2019 17:37:42 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.14.4/8.14.4) with ESMTP id x1DHbZuR029709; Wed, 13 Feb 2019 12:37:35 -0500 From: Tom Lane To: Andres Freund cc: Thomas Munro , PostgreSQL Hackers Subject: Re: subscriptionCheck failures on nightjar In-reply-to: <20190213171101.6wpz7tardp3t3uvk@alap3.anarazel.de> References: <17827.1549866683@sss.pgh.pa.us> <27965.1550077052@sss.pgh.pa.us> <20190213171101.6wpz7tardp3t3uvk@alap3.anarazel.de> Comments: In-reply-to Andres Freund message dated "Wed, 13 Feb 2019 09:11:01 -0800" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <29707.1550079455.1@sss.pgh.pa.us> Date: Wed, 13 Feb 2019 12:37:35 -0500 Message-ID: <29708.1550079455@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Precedence: bulk Andres Freund writes: > On 2019-02-13 11:57:32 -0500, Tom Lane wrote: >> I've managed to reproduce this locally, and obtained this PANIC: > Cool. How exactly? Andrew told me that nightjar is actually running in a qemu VM, so I set up freebsd 9.0 in a qemu VM, and boom. It took a bit of fiddling with qemu parameters, but for such a timing-sensitive problem, that's not surprising. >> Anyway, I think we might be able to fix this along the lines of >> [ fsync the data before renaming not after ] > Hm, but that's not the same? On some filesystems one needs the directory > fsync, on some the file fsync, and I think both in some cases. Now that I look at it, there's a pg_fsync() just above this, so I wonder why we need a second fsync on the file at all. fsync'ing the directory is needed to ensure the directory entry is on disk; but the file data should be out already, or else the kernel is simply failing to honor fsync. >> The existing code here seems simply wacky/unsafe to me regardless >> of this race condition: couldn't it potentially result in a corrupt >> snapshot file appearing with a valid name, if the system crashes >> after persisting the rename but before it's pushed the data out? > What do you mean precisely with "before it's pushed the data out"? Given the previous pg_fsync, this isn't an issue. >> I also wonder why bother with the directory sync just before the >> rename. > Because on some FS/OS combinations the size of the renamed-into-place > file isn't guaranteed to be durable unless the directory was > fsynced. Bleah. But in any case, the rename should not create a situation in which we need to fsync the file data again. regards, tom lane