public inbox for [email protected]  
help / color / mirror / Atom feed
From: Tomas Vondra <[email protected]>
To: Michael Paquier <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Tom Lane <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: subscriptionCheck failures on nightjar
Date: Mon, 26 Aug 2019 15:29:04 +0200
Message-ID: <20190826132904.3ayuw36qzl2c4ktr@development> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CAEepm=0wB7vgztC5sg2nmJ-H3bnrBT5GQfhUzP+Ffq-WT3g8VA@mail.gmail.com>
	<[email protected]>
	<[email protected]>

On Tue, Aug 13, 2019 at 05:04:35PM +0900, Michael Paquier wrote:
>On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote:
>> I'm not yet sure that that's actually something that's supposed to
>> happen, I got to spend some time analysing how this actually
>> happens. Normally the contents of the slot should actually prevent it
>> from being removed (as they're newer than
>> ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's
>> a bug in the drop logic in newer releases.
>
>In the same context, could it be a consequence of 9915de6c which has
>introduced a conditional variable to control slot operations?  This
>could have exposed more easily a pre-existing race condition.
>--

This is one of the remaining open items, and we don't seem to be moving
forward with it :-(

I'm willing to take a stab at it, but to do that I need a way to
reproduce it. Tom, you mentioned you've managed to reproduce it in a
qemu instance, but that it took some fiddling with qemu parmeters or
something. Can you share what exactly was necessary?

An observation about the issue - while we started to notice this after
Decemeber, that's mostly because the PANIC patch went it shortly before.
We've however seen the issue before, as Thomas Munro mentioned in [1].

Those reports are from August, so it's quite possible something in the
first CF upset the code. And there's only a single commit in 2018-07
that seems related to logical decoding / snapshots [2], i.e. f49a80c:

commit f49a80c481f74fa81407dce8e51dea6956cb64f8
Author: Alvaro Herrera <[email protected]>
Date:   Tue Jun 26 16:38:34 2018 -0400

    Fix "base" snapshot handling in logical decoding

    ...

The other reason to suspect this is related is that the fix also made it
to REL_11_STABLE at that time, and if you check the buildfarm data [3],
you'll see 11 fails on nightjar too, from time to time.

This means it's not a 12+ only issue, it's a live issue on 11. I don't
know if f49a80c is the culprit, or if it simply uncovered a pre-existing
bug (e.g. due to timing).


[1] https://www.postgresql.org/message-id/CAEepm%3D0wB7vgztC5sg2nmJ-H3bnrBT5GQfhUzP%2BFfq-WT3g8VA%40mail...

[2] https://commitfest.postgresql.org/18/1650/

[3] https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=nightjar&br=REL_11_STABLE

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





view thread (44+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: subscriptionCheck failures on nightjar
  In-Reply-To: <20190826132904.3ayuw36qzl2c4ktr@development>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox