Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1iBR09-0004Ye-BR for pgsql-hackers@arkaria.postgresql.org; Fri, 20 Sep 2019 22:03:41 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.89) (envelope-from ) id 1iBR08-0003mW-23 for pgsql-hackers@arkaria.postgresql.org; Fri, 20 Sep 2019 22:03:40 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1iBR07-0003mB-Md for pgsql-hackers@lists.postgresql.org; Fri, 20 Sep 2019 22:03:39 +0000 Received: from wout3-smtp.messagingengine.com ([64.147.123.19]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1iBR05-0006yw-2U for pgsql-hackers@lists.postgresql.org; Fri, 20 Sep 2019 22:03:38 +0000 Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.west.internal (Postfix) with ESMTP id 7F333510; Fri, 20 Sep 2019 18:03:34 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute5.internal (MEProxy); Fri, 20 Sep 2019 18:03:35 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= date:from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to; s=fm2; bh=G1skentQrUoDL8SPuqm+awDa6sv okI5F+spekI+NfAQ=; b=dScqdgCbg6NXM+PI9eNfvyu0xE3LrCMIDsOcACpYrqO igbgxs+cbpllQcVbwCcxcRD8R5hSiMg+8lNp9DXMFWqnY4dvkc7unoJNRo1rLj6J 2k/Ol4eCK1/LjtYnFYrTyOcX0f54ui+N6J1Ln083MneZaV3r54vOHTNMpvb3isbV /4Pak8G3uZFJpmnuA0EAFTkhfCWHrXWObnbRQSnY2PlhVmAu64o2tmjrTovFZVd3 kL0j59FEQdsCH+lIkahkABddlzS6lNZcTuptZfpTw9FtyKW1zuCYWvpn1VSO2xel T/f5+/viqZ73RpfJQ3kEkJfPkea99kzhHjoiE+m7zVQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; bh=G1sken tQrUoDL8SPuqm+awDa6svokI5F+spekI+NfAQ=; b=x4ZEeRNgOIlq+EuzrgIPK/ sLW8SFZXLuFH0I/YuXId188gy6lsZcD4h1UtD0t/rfME1zQu89TWojvjfYUU6znU z20PC7Lv4yq+T35PmfwoBmK5gMXjeEx/BvNNfEQeFgWYd0lGP1zXzYnVb8y14UEu yXNReOEbyxOC5FJ4OLo6kIQOQJWxAbl+zrAgxzObJ356x6E1nwXpc/v5rre0rosq zbVAZ1vUfSQdd/T/Xl2B88Sft0vQjZCT1oTfKjKwLevrdNQnURuRbDZvIVa16qfK 8kM5OUwOaQxNqEghuGUB7/ypsGj9myp1hfLlrNnmAHnHnbOZT/qrQMJd9YW3li2w == X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedufedrvdefgddtiecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpeffhffvuffkfhggtggujgesthdtredttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecukfhppeeije drudeitddrvddukedrvdefjeenucfrrghrrghmpehmrghilhhfrhhomheprghnughrvghs segrnhgrrhgriigvlhdruggvnecuvehluhhsthgvrhfuihiivgeptd X-ME-Proxy: Received: from intern.anarazel.de (c-67-160-218-237.hsd1.ca.comcast.net [67.160.218.237]) by mail.messagingengine.com (Postfix) with ESMTPA id 54E14D6005B; Fri, 20 Sep 2019 18:03:33 -0400 (EDT) Date: Fri, 20 Sep 2019 15:03:32 -0700 From: Andres Freund To: Tom Lane Cc: Andrew Dunstan , Kuntal Ghosh , Michael Paquier , Tomas Vondra , Robert Haas , Thomas Munro , PostgreSQL Hackers Subject: Re: subscriptionCheck failures on nightjar Message-ID: <20190920220332.qhd4ym26wa76ajqt@alap3.anarazel.de> References: <20190917194510.iqwyl3be62pz7l27@development> <20190918005815.GB8909@paquier.xyz> <20190918215808.yonxqgycme6pbctp@development> <20190919042305.GA21144@paquier.xyz> <20190920170831.aaljabal6lyivre5@alap3.anarazel.de> <29511.1569011121@sss.pgh.pa.us> <20190920212603.7zlgrlwtdirbmuw7@alap3.anarazel.de> <2636.1569016167@sss.pgh.pa.us> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2636.1569016167@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Precedence: bulk Hi, On 2019-09-20 17:49:27 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2019-09-20 16:25:21 -0400, Tom Lane wrote: > >> I recreated my freebsd-9-under-qemu setup and I can still reproduce > >> the problem, though not with high reliability (order of 1 time in 10). > >> Anything particular you want logged? > > > A DEBUG2 log would help a fair bit, because it'd log some information > > about what changes the "horizons" determining when data may be removed. > > Actually, what I did was as attached [1], and I am getting traces like > [2]. The problem seems to occur only when there are two or three > processes concurrently creating the same snapshot file. It's not > obvious from the debug trace, but the snapshot file *does* exist > after the music stops. > > It is very hard to look at this trace and conclude anything other > than "rename(2) is broken, it's not atomic". Nothing in our code > has deleted the file: no checkpoint has started, nor do we see > the DEBUG1 output that CheckPointSnapBuild ought to produce. > But fsync_fname momentarily can't see it (and then later another > process does see it). Yikes. No wondering most of us weren't able to reproduce the problem. And that staring at our code didn't point to a bug. Nice catch. > In short, what we got here is OS bugs that have probably been > resolved years ago. > > The question is what to do next. Should we just retire these > specific buildfarm critters, or do we want to push ahead with > getting rid of the PANIC here? Hm. Given that the fsync failing is actually an issue, I'm somewhat disinclined to remove the PANIC. It's not like only raising an ERROR actually solves anything, except making the problem even harder to diagnose? Or that we otherwise are ok, with renames not being atomic? So I'd be tentatively in favor of either upgrading, replacing the filesystem (perhaps ZFS isn't buggy in the same way?), or retiring those animals. Greetings, Andres Freund