Message-ID: <4B74485A.3050804@enterprisedb.com>
Date: Thu, 11 Feb 2010 20:11:38 +0200
From: Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>
Organization: EnterpriseDB
User-Agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090706)
MIME-Version: 1.0
To: Aidan Van Dyk <aidan@highrise.ca>
CC: Simon Riggs <simon@2ndQuadrant.com>, Fujii Masao <masao.fujii@gmail.com>,
	PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: Re: [COMMITTERS] pgsql: Make standby server	continuously
	retry  restoring the next WAL
References: <1265884657.7341.1192.camel@ebony>
	<4B73F678.8070109@enterprisedb.com>
	<1265891248.7341.1346.camel@ebony>
	<4B73FB99.4080403@enterprisedb.com>
	<1265893599.7341.1454.camel@ebony>
	<4B740613.5090004@enterprisedb.com>
	<20100211140118.GB14128@oak.highrise.ca>
	<4B74118C.30704@enterprisedb.com>
	<20100211144204.GC14128@oak.highrise.ca>
	<4B7438A9.8090902@enterprisedb.com>
	<20100211173154.GD14128@oak.highrise.ca>
In-Reply-To: <20100211173154.GD14128@oak.highrise.ca>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Aidan Van Dyk wrote:
> * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 12:04]:
> 
>>> But it can be a problem - without the last WAL (or at least enough of
>>> it) the master switched and archived, you have no guarantee of having
>>> being consistent again (I'm thinking specifically of recovering from a
>>> fresh backup)
>> You have to wait for the last WAL file required by the backup to be
>> archived before starting recovery. Otherwise there's no guarantee anyway.
> 
> Right, but now define "wait for".

As in don't start postmaster until the last WAL file needed by backup
has been fully copied to the archive.

> (because you've accepted that you *can* have short WAL files in the
> archive)

Only momentarily, while the copy is in progress.

> I've always made my PITR such that "in the archive" (i.e. the first
> moment a recovery can see it) implies that it's bit-for-bit identical to
> the original (or at least as bit-for-bit I can assume by checking
> various hashes I can afford to).  I just assumed that was kind of common
> practice.

It's certainly good practice, agreed, but hasn't been absolutely required.

> I'm amazed that "partial WAL" files are every available in anyones
> archive, for anyone's  restore command to actually pull.  I find that
> scarry, and sure, probably won't regularly be noticed... But man, I'ld
> hate the time I need that emergency PITR restore to be the one time when
> it needs that WAL, pulls it slightly before the copy has finished (i.e.
> the master is pushing the WAL over a WAN to a 2nd site), and have my
> restore complete consistently...

It's not as dramatic as you make it sound. We're only talking about the
last WAL file, and only when it's just being copied to the archive. If
you have a archive_command like 'cp', and you look at the archive at the
same millisecond that 'cp' runs, then yes you will see that the latest
WAL file in the archive is only partially copied. It's not a problem for
robustness; if you had looked one millisecond earlier you would not have
seen the file there at all.

Windows 'copy' command preallocates the whole file, which poses a
different problem: if you look at the file while it's being copied, the
file has the right length, but isn't in fact fully copied yet. I think
'rsync' has the same problem. To avoid that issue, you have to use
something like copy+rename to make it atomic. There isn't much we can do
in the server (or in pg_standby) to work around that, because there's no
way to distinguish a file that's being copied from a fully-copied
corrupt file.

We do advise to set up an archive_command that doesn't overwrite
existing files. That together with a partial WAL segment can cause a
problem: if archive_command crashes while it's writing the file, leaving
a partial file in the archive, the subsequent run of archive_command
won't overwrite it and will get stuck trying. However, there's a small
window for that even if you put the file into the archive atomically: if
you crash just after fully copying the file, but before the .done file
is created, upon restart the server will also try to copy the file to
archive, find that it already exists, and fail.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com