Received: from maia.hub.org (unknown [200.46.208.211]) by mail.postgresql.org (Postfix) with ESMTP id BF70B632B5E for ; Wed, 10 Feb 2010 03:33:49 -0400 (AST) Received: from mail.postgresql.org ([200.46.204.86]) by maia.hub.org (mx1.hub.org [200.46.208.211]) (amavisd-maia, port 10024) with ESMTP id 84450-08 for ; Wed, 10 Feb 2010 07:33:36 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from exprod7og110.obsmtp.com (exprod7og110.obsmtp.com [64.18.2.173]) by mail.postgresql.org (Postfix) with SMTP id 93822632FB5 for ; Wed, 10 Feb 2010 03:33:38 -0400 (AST) Received: from source ([72.14.220.156]) by exprod7ob110.postini.com ([64.18.6.12]) with SMTP ID DSNKS3JhUTyiYMhDH7ZT8xw8CIvtKqtYYeP5@postini.com; Tue, 09 Feb 2010 23:33:38 PST Received: by fg-out-1718.google.com with SMTP id e21so111490fga.11 for ; Tue, 09 Feb 2010 23:33:36 -0800 (PST) Received: by 10.87.72.8 with SMTP id z8mr2288950fgk.37.1265787216232; Tue, 09 Feb 2010 23:33:36 -0800 (PST) Received: from ?85.77.241.26? (MMMCCCXXV.gprs.sl-laajakaista.fi [85.77.241.26]) by mx.google.com with ESMTPS id 16sm452660fxm.8.2010.02.09.23.33.33 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 09 Feb 2010 23:33:35 -0800 (PST) Message-ID: <4B726120.80007@enterprisedb.com> Date: Wed, 10 Feb 2010 09:32:48 +0200 From: Heikki Linnakangas Organization: EnterpriseDB User-Agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090706) MIME-Version: 1.0 To: Fujii Masao CC: PostgreSQL-development Subject: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL References: <20100127152751.3B2047541B9@cvs.postgresql.org> <3f0b79eb1002092105r21e009d3v468496058ba04392@mail.gmail.com> In-Reply-To: <3f0b79eb1002092105r21e009d3v468496058ba04392@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Scanned: Maia Mailguard 1.0.1 X-Spam-Status: No, hits=-2.599 tagged_above=-10 required=5 tests=BAYES_00=-2.599 X-Spam-Level: X-Archive-Number: 201002/714 X-Sequence-Number: 157057 Fujii Masao wrote: > As I pointed out previously, the standby might restore a partially-filled > WAL file that is being archived by the primary, and cause a FATAL error. > And this happened in my box when I was testing the SR. > > sby [20088] FATAL: archive file "000000010000000000000087" has > wrong size: 14139392 instead of 16777216 > sby [20076] LOG: startup process (PID 20088) exited with exit code 1 > sby [20076] LOG: terminating any other active server processes > act [18164] LOG: received immediate shutdown request > > If the startup process is in standby mode, I think that it should retry > starting replication instead of emitting an error when it finds a > partially-filled file in the archive. Then if the replication has been > terminated, it has only to restore the archived file again. Thought? Hmm, so after running restore_command, check the file size and if it's too short, treat it the same as if restore_command returned non-zero? And it will be retried on the next iteration. Works for me, though OTOH it will then fail to complain about a genuinely WAL file that's truncated for some reason. I guess there's no way around that, even if you have a script as restore_command that does the file size check, it will have the same problem. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com