Received: from maia.hub.org (unknown [200.46.204.183]) by mail.postgresql.org (Postfix) with ESMTP id 8674D6323A6 for ; Tue, 23 Mar 2010 04:18:05 -0300 (ADT) Received: from mail.postgresql.org ([200.46.204.86]) by maia.hub.org (mx1.hub.org [200.46.204.183]) (amavisd-maia, port 10024) with ESMTP id 51062-03 for ; Tue, 23 Mar 2010 07:17:54 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from mail-gw0-f46.google.com (mail-gw0-f46.google.com [74.125.83.46]) by mail.postgresql.org (Postfix) with ESMTP id 1218963235C for ; Tue, 23 Mar 2010 04:17:54 -0300 (ADT) Received: by gwaa18 with SMTP id a18so1369350gwa.19 for ; Tue, 23 Mar 2010 00:17:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=+DXhpN0VaOCmFbCYbfvqBq60JAeS3d8HrMNMQzhqIaM=; b=w74CklVxUSqQ3ehZVOa29soZk95kPVw0v+Et/t8kOIbSdvIBKhz1zMRjgzV9m0DCQT jsQxp7k3EZg8ku+JeqO6Zjoo3eGBpPXE3BNHeFQJp/G22yTr1pbBNGxEfhVYrT5TXViH VA/R5AHgMvbkJYqnrl50DAf6Olp+HdNyGnBxU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=fXT2vAROJMHLJLrHW1tiKnDiPUvwnsbvSqp1zD0dEPj+1RmG2Ys6LmLC5DKOGqDFj6 +mflQFxRShvuZMUzetC1je8a/ZUyxnPQBjR1ScPyIwMEM0utAE2wjxBqHC8bYOOD7Bfa sm+TdbsBgf6xu34btCpvvpUI4TYVrvQcZijkY= MIME-Version: 1.0 Received: by 10.101.56.4 with SMTP id i4mr3514362ank.238.1269328673971; Tue, 23 Mar 2010 00:17:53 -0700 (PDT) In-Reply-To: <4BA361E4.7020309@enterprisedb.com> References: <3f0b79eb1002092105r21e009d3v468496058ba04392@mail.gmail.com> <20100211140118.GB14128@oak.highrise.ca> <4B74118C.30704@enterprisedb.com> <20100211144204.GC14128@oak.highrise.ca> <4B743E7D.5070603@enterprisedb.com> <3f0b79eb1002180337t1fab1395ve3491256672af15f@mail.gmail.com> <4BA0B079.3050301@enterprisedb.com> <3f0b79eb1003180727g7877743eq81274e014fe70a49@mail.gmail.com> <1268988724.3556.3.camel@ebony> <4BA361E4.7020309@enterprisedb.com> Date: Tue, 23 Mar 2010 16:17:53 +0900 Message-ID: <3f0b79eb1003230017v16f4ecbeyc20e75beeffe8f1c@mail.gmail.com> Subject: Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL From: Fujii Masao To: Heikki Linnakangas Cc: Simon Riggs , Aidan Van Dyk , PostgreSQL-development Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Scanned: Maia Mailguard 1.0.1 X-Spam-Status: No, hits=-2.599 tagged_above=-10 required=5 tests=BAYES_00=-2.599 X-Spam-Level: X-Archive-Number: 201003/912 X-Sequence-Number: 159688 Sorry for the delay. On Fri, Mar 19, 2010 at 8:37 PM, Heikki Linnakangas wrote: > Here's a patch I've been playing with. Thanks! I'm reading the patch. > The idea is that in standby mode, > the server keeps trying to make progress in the recovery by: > > a) restoring files from archive > b) replaying files from pg_xlog > c) streaming from master > > When recovery reaches an invalid WAL record, typically caused by a > half-written WAL file, it closes the file and moves to the next source. > If an error is found in a file restored from archive or in a portion > just streamed from master, however, a PANIC is thrown, because it's not > expected to have errors in the archive or in the master. But in the current (v8.4 or before) behavior, recovery ends normally when an invalid record is found in an archived WAL file. Otherwise, the server would never be able to start normal processing when there is a corrupted archived file for some reasons. So, that invalid record should not be treated as a PANIC if the server is not in standby mode or the trigger file has been created. Thought? When I tested the patch, the following PANIC error was thrown in the normal archive recovery. This seems to derive from the above change. The detail error sequence: 1. In ReadRecord(), emode was set to PANIC after 00000001000000000000000B was read. 2. 00000001000000000000000C including the contrecord tried to be read by using the emode (= PANIC). But since 00000001000000000000000C did not exist, PANIC error was thrown. ----------------- LOG: restored log file "00000001000000000000000B" from archive cp: cannot stat `../data.arh/00000001000000000000000C': No such file or directory PANIC: could not open file "pg_xlog/00000001000000000000000C" (log file 0, segment 12): No such file or directory LOG: startup process (PID 17204) was terminated by signal 6: Aborted LOG: terminating any other active server processes ----------------- Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center