Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gtxqt-0007sp-AO for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Feb 2019 16:57:39 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.89) (envelope-from ) id 1gtxqr-0001CP-Q9 for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Feb 2019 16:57:37 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gtxqr-0001By-II for pgsql-hackers@lists.postgresql.org; Wed, 13 Feb 2019 16:57:37 +0000 Received: from sss.pgh.pa.us ([66.207.139.130]) by magus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1gtxqp-0007dR-Hd for pgsql-hackers@lists.postgresql.org; Wed, 13 Feb 2019 16:57:37 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.14.4/8.14.4) with ESMTP id x1DGvWSF027966; Wed, 13 Feb 2019 11:57:32 -0500 From: Tom Lane To: Thomas Munro cc: PostgreSQL Hackers Subject: Re: subscriptionCheck failures on nightjar In-reply-to: References: <17827.1549866683@sss.pgh.pa.us> Comments: In-reply-to Thomas Munro message dated "Thu, 14 Feb 2019 00:55:59 +1300" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <27964.1550077052.1@sss.pgh.pa.us> Content-Transfer-Encoding: quoted-printable Date: Wed, 13 Feb 2019 11:57:32 -0500 Message-ID: <27965.1550077052@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Precedence: bulk Thomas Munro writes: > On Mon, Feb 11, 2019 at 7:31 PM Tom Lane wrote: >> 2019-02-10 23:55:58.798 EST [40728] sub1 PANIC: could not open file "p= g_logical/snapshots/0-160B578.snap": No such file or directory > > They get atomically renamed into place, which seems kosher even if > snapshots for the same LSN are created concurrently by different > backends (and tracing syscalls confirms that that does occasionally > happen). It's hard to believe that nightjar's rename() ceased to be > atomic a couple of months ago. It looks like the only way for files > to get unlinked after that is by CheckPointSnapBuild() deciding they > are too old. > Hmm. Could this be relevant, and cause a well timed checkpoint to > unlink files too soon? > 2019-02-12 21:52:58.304 EST [22922] WARNING: out of logical > replication worker slots I've managed to reproduce this locally, and obtained this PANIC: log/010_truncate_publisher.log:2019-02-13 11:29:04.759 EST [9973] sub3 PAN= IC: could not open file "pg_logical/snapshots/0-16067B0.snap": No such fi= le or directory with this stack trace: #0 0x0000000801ab610c in kill () from /lib/libc.so.7 #1 0x0000000801ab4d3b in abort () from /lib/libc.so.7 #2 0x000000000089202e in errfinish (dummy=3DVariable "dummy" is not avail= able. ) at elog.c:552 #3 0x000000000075d339 in fsync_fname_ext ( fname=3D0x7fffffffba20 "pg_logical/snapshots/0-16067B0.snap", isdir=3D= Variable "isdir" is not available. ) at fd.c:3372 #4 0x0000000000730c75 in SnapBuildSerialize (builder=3D0x80243c118, = lsn=3D23095216) at snapbuild.c:1669 #5 0x0000000000731297 in SnapBuildProcessRunningXacts (builder=3D0x80243c= 118, = lsn=3D23095216, running=3D0x8024424f0) at snapbuild.c:1110 #6 0x0000000000722eac in LogicalDecodingProcessRecord (ctx=3D0x802414118,= = record=3D0x8024143d8) at decode.c:318 #7 0x0000000000736f30 in XLogSendLogical () at walsender.c:2826 #8 0x00000000007389c7 in WalSndLoop (send_data=3D0x736ed0 ) at walsender.c:2184 #9 0x0000000000738c3d in StartLogicalReplication (cmd=3DVariable "cmd" is= not available. ) at walsender.c:1118 #10 0x0000000000739895 in exec_replication_command ( cmd_string=3D0x80240e118 "START_REPLICATION SLOT \"sub3\" LOGICAL 0/0 = (proto_version '1', publication_names '\"pub3\"')") at walsender.c:1536 #11 0x000000000078b272 in PostgresMain (argc=3D-14464, argv=3DVariable "ar= gv" is not available. ) at postgres.c:4252 #12 0x00000000007113fa in PostmasterMain (argc=3D-14256, argv=3D0x7fffffff= cc60) at postmaster.c:4382 So the problem seems to boil down to "somebody removed the snapshot file between the time we renamed it into place and the time we tried to fsync it". I do find messages like the one you mention, but they are on the subscriber not the publisher side, so I'm not sure if this matches the scenario you have in mind? log/010_truncate_subscriber.log:2019-02-13 11:29:02.343 EST [9970] WARNING= : out of logical replication worker slots log/010_truncate_subscriber.log:2019-02-13 11:29:02.344 EST [9970] WARNING= : out of logical replication worker slots log/010_truncate_subscriber.log:2019-02-13 11:29:03.401 EST [9970] WARNING= : out of logical replication worker slots Anyway, I think we might be able to fix this along the lines of CloseTransientFile(fd); + /* ensure snapshot file is down to stable storage */ + fsync_fname(tmppath, false); fsync_fname("pg_logical/snapshots", true); /* * We may overwrite the work from some other backend, but that's ok, o= ur * snapshot is valid as well, we'll just have done some superfluous wo= rk. */ if (rename(tmppath, path) !=3D 0) { ereport(ERROR, (errcode_for_file_access(), errmsg("could not rename file \"%s\" to \"%s\": %m", tmppath, path))); } - /* make sure we persist */ - fsync_fname(path, false); + /* ensure we persist the file rename */ fsync_fname("pg_logical/snapshots", true); The existing code here seems simply wacky/unsafe to me regardless of this race condition: couldn't it potentially result in a corrupt snapshot file appearing with a valid name, if the system crashes after persisting the rename but before it's pushed the data out? I also wonder why bother with the directory sync just before the rename. regards, tom lane