public inbox for [email protected]  
help / color / mirror / Atom feed
From: Kevin Grittner <[email protected]>
To: Josh Berkus <[email protected]>
To: [email protected] <[email protected]>
Subject: Re: Sample archive_command is still problematic
Date: Mon, 11 Aug 2014 11:36:28 -0700
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<WM!2e9e963ef1fa06e085aa9dbf135a205b718dad5470c1598eeabb71225f19676528ed8e4fad97f9075cbabdcbac8b68ac!@asav-1.01.com>
	<[email protected]>
List-Unsubscribe: <mailto:[email protected]?body=unsub%20pgsql-docs>

Josh Berkus <[email protected]> wrote:
> On 08/11/2014 10:21 AM, Kevin Grittner wrote:
>>> Is there some good reason why "test ! -f" was added to the
>>> sample?
>>
>> In an environment with more than one cluster archiving, it is
>> otherwise way too easy to copy a config file and have the WAL files
>> of the two systems overwriting one another.  I consider a check for
>> an already existing file on the target to be very good practice.
>> The errors in the log are a clue that something went wrong, and
>> gives you a chance to fix things without data loss.
>
> It depends on what you're guarding against.  In the case I was dealing
> with, the master crashed in the middle of an archive write.  As a
> result, the file existed, but was incomplete, and *needed* to be
> overwritten.  But because of 'test -f' archiving just kept failing.

I've seen that happen, too.  It's just that the script I used sent
an email to the DBAs when that happened, so the problem was quickly
investigated and resolved.  Also, our monitoring "big board" set an 
"LED" to red if we went an hour without a new WAL appearing in the 
archive directory.  IMV the archiving script should ensure there is 
no data loss, and you should have monitoring or alert systems in 
place to know when things stall.

>> The problem with the recommended command is that cp is not atomic.
>> The file can be read before the contents are materialized, causing
>> early end to recovery.  I have seen it happen.  The right way to do
>> this is to copy to a different name or directory and mv the file
>> into place once it is complete -- or use software which does that
>> automatically, like rsync does.
>
> Yeah, realistically, I think we need to start supplying a script or two
> in /contrib and referencing that.  I'm not sure how to make it work for
> the Windows users though.

That might work.  We should do something, though.  The example we
give in the docs is not production quality IMO, and is something of
an embarrassment.  The problem is, it may be hard to get agreement
on what that should look like.  As a DBA, I insisted on the check
for an existing file.  I also insisted on having scripts send an
email to the DBAs on the first occurrence of a failure (but not to
spam us on each and every failed attempt).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-docs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-docs



view thread (25+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Sample archive_command is still problematic
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox