Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gt57s-0006y8-7n for pgsql-hackers@arkaria.postgresql.org; Mon, 11 Feb 2019 06:31:32 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.89) (envelope-from ) id 1gt57q-0001R1-IU for pgsql-hackers@arkaria.postgresql.org; Mon, 11 Feb 2019 06:31:30 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gt57q-0001Qu-BG for pgsql-hackers@lists.postgresql.org; Mon, 11 Feb 2019 06:31:30 +0000 Received: from sss.pgh.pa.us ([66.207.139.130]) by magus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gt57m-0003Y0-LT for pgsql-hackers@lists.postgresql.org; Mon, 11 Feb 2019 06:31:29 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.14.4/8.14.4) with ESMTP id x1B6VNQP017828 for ; Mon, 11 Feb 2019 01:31:23 -0500 From: Tom Lane To: pgsql-hackers@lists.postgresql.org Subject: subscriptionCheck failures on nightjar MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <17826.1549866683.1@sss.pgh.pa.us> Content-Transfer-Encoding: quoted-printable Date: Mon, 11 Feb 2019 01:31:23 -0500 Message-ID: <17827.1549866683@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Precedence: bulk nightjar just did this: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=3Dnightjar&dt=3D20= 19-02-11%2004%3A33%3A07 The critical bit seems to be that the publisher side of the 010_truncate.pl test failed like so: 2019-02-10 23:55:58.765 EST [40771] sub3 LOG: statement: BEGIN READ ONLY = ISOLATION LEVEL REPEATABLE READ 2019-02-10 23:55:58.765 EST [40771] sub3 LOG: received replication comman= d: CREATE_REPLICATION_SLOT "sub3_16414_sync_16394" TEMPORARY LOGICAL pgout= put USE_SNAPSHOT 2019-02-10 23:55:58.798 EST [40728] sub1 PANIC: could not open file "pg_l= ogical/snapshots/0-160B578.snap": No such file or directory 2019-02-10 23:55:58.800 EST [40771] sub3 LOG: logical decoding found cons= istent point at 0/160B578 2019-02-10 23:55:58.800 EST [40771] sub3 DETAIL: There are no running tra= nsactions. I'm not sure what to make of that, but I notice that nightjar has failed subscriptionCheck seven times since mid-December, and every one of those shows this same PANIC. Meanwhile, no other buildfarm member has produced such a failure. It smells like a race condition with a rather tight window, but that's just a guess. So: (1) what's causing the failure? (2) could we respond with something less than take-down-the-whole-database when a failure happens in this area? regards, tom lane