Received: from maia.hub.org (unknown [200.46.204.183]) by mail.postgresql.org (Postfix) with ESMTP id 09474633E38 for ; Thu, 25 Mar 2010 05:11:43 -0300 (ADT) Received: from mail.postgresql.org ([200.46.204.86]) by maia.hub.org (mx1.hub.org [200.46.204.183]) (amavisd-maia, port 10024) with ESMTP id 52123-08 for ; Thu, 25 Mar 2010 08:11:32 +0000 (UTC) X-Greylist: from auto-whitelisted by SQLgrey-1.7.6 Received: from outmail148102.authsmtp.net (outmail148102.authsmtp.net [62.13.148.102]) by mail.postgresql.org (Postfix) with ESMTP id 3C827633CBA for ; Thu, 25 Mar 2010 05:11:32 -0300 (ADT) Received: from mail-c187.authsmtp.com (mail-c187.authsmtp.com [62.13.128.33]) by punt7.authsmtp.com (8.14.2/8.14.2/Kp) with ESMTP id o2P8BPAi039626; Thu, 25 Mar 2010 08:11:25 GMT Received: from [192.168.0.4] (88-110-151-22.dynamic.dsl.as9105.com [88.110.151.22]) (authenticated bits=0) by mail.authsmtp.com (8.14.2/8.14.2/Kp) with ESMTP id o2P8BLE1077734; Thu, 25 Mar 2010 08:11:21 GMT Subject: Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL From: Simon Riggs To: Fujii Masao Cc: Heikki Linnakangas , Aidan Van Dyk , PostgreSQL-development In-Reply-To: <3f0b79eb1003241908n1e8f38e0q7cd7465163b3d7af@mail.gmail.com> References: <3f0b79eb1002092105r21e009d3v468496058ba04392@mail.gmail.com> <4B743E7D.5070603@enterprisedb.com> <3f0b79eb1002180337t1fab1395ve3491256672af15f@mail.gmail.com> <4BA0B079.3050301@enterprisedb.com> <3f0b79eb1003180727g7877743eq81274e014fe70a49@mail.gmail.com> <1268988724.3556.3.camel@ebony> <4BA361E4.7020309@enterprisedb.com> <3f0b79eb1003230017v16f4ecbeyc20e75beeffe8f1c@mail.gmail.com> <4BAA060A.2020000@enterprisedb.com> <1269472981.8481.8946.camel@ebony> <3f0b79eb1003241908n1e8f38e0q7cd7465163b3d7af@mail.gmail.com> Content-Type: text/plain Date: Thu, 25 Mar 2010 08:08:11 +0000 Message-Id: <1269504491.8481.8965.camel@ebony> Mime-Version: 1.0 X-Mailer: Evolution 2.26.1 Content-Transfer-Encoding: 7bit X-Server-Quench: ffbaab77-37e5-11df-ab46-001185d377ca X-AuthReport-Spam: If SPAM / abuse - report it at: http://www.authsmtp.com/abuse X-AuthRoute: OCdxZQATClZOTQEd DAteCiN5VAwpPBRK HVkIKg5MJUcNSQVJ NksachtFagBbYFhD HGQLWlREUFV7WGJ/ aQgfZQ1DY0tOQQRv UVZLQE1XHAJ3AVJe BH4WIGUHdgVBf3lz YQhjXXFTXAosIxQs ExxTEnAFYjJldWEe BBZFJlFQdh5Kfh5E YlUrV3QKMjRJBC9q VzwTFhsSEA9kHWxv T1NFHnk1ZGMqIgIR fSs3VS0gBlQBFW0W Jh8rYkUAFUAdOFR6 KlY7R18Ce31aAQpY A0BLHShEPF0QDzAm ERlLFVQTCyFaWyZa DVU0IhIABDFCRmJn LW8t X-Authentic-SMTP: 61633235383639.1000:706/Kp X-AuthFastPath: 255 X-Virus-Status: No virus detected - but ensure you scan with your own anti-virus system. X-Virus-Scanned: Maia Mailguard 1.0.1 X-Spam-Status: No, hits=-2.599 tagged_above=-10 required=5 tests=BAYES_00=-2.599 X-Spam-Level: X-Archive-Number: 201003/998 X-Sequence-Number: 159774 On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote: > On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs wrote: > > PANICing won't change the situation, so it just destroys server > > availability. If we had 1 master and 42 slaves then this behaviour would > > take down almost the whole server farm at once. Very uncool. > > > > You might have reason to prevent the server starting up at that point, > > when in standby mode, but that is not a reason to PANIC. We don't really > > want all of the standbys thinking they can be the master all at once > > either. Better to throw a serious ERROR and have the server still up and > > available for reads. > > OK. How about making the startup process emit WARNING, stop WAL replay and > wait for the presence of trigger file, when an invalid record is found? > Which keeps the server up for readonly queries. And if the trigger file is > found, I think that the startup process should emit a FATAL, i.e., the > server should exit immediately, to prevent the server from becoming the > primary in a half-finished state. Also to allow such a halfway failover, > we should provide fast failover mode as pg_standby does? The lack of docs begins to show a lack of coherent high-level design here. By now, I've forgotten what this thread was even about. The major design decision in this that keeps showing up is "remove pg_standby, at all costs" but no reason has ever been given for that. I do believe there is a "better way", but we won't find it by trial and error, even if we had time to do so. Please work on some clear docs for the failure modes in this system. That way we can all read them and understand them, or point out further issues. Moving straight to code is not a solution to this, since what we need now is to all agree on the way forwards. If we ignore this, then there is considerable risk that streaming rep will have a fatal operational flaw. Please just document/diagram how it works now, highlighting the problems that still remain to be solved. We're all behind you and I'm helping wherever I can. -- Simon Riggs www.2ndQuadrant.com