Actual RC of "restore_command" is relevant for DB startup

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Gunnar "Nick" Bluth <[email protected]>
To: [email protected]
Cc: Gunnar Nick Bluth <[email protected]>
Subject: Actual RC of "restore_command" is relevant for DB startup
Date: Wed, 20 Apr 2016 14:55:34 +0200
Message-ID: <[email protected]> (raw)
List-Unsubscribe: <mailto:[email protected]?body=unsub%20pgsql-docs>

Hello,

I've just stumbled across a certain oddity with "restore_command" while
setting up a fresh environment with segmented (i.e., firewalled) networks.

I configured the restore_command as found in the PGBARMan docs (using
ssh) and was a bit stunned that after a restart, I saw this in the logs:

2016-04-20 13:22:45 CEST [3788]: [2-1] db=,user= FATAL:  could not
restore file "00000002.history" from archive: child process exited with
exit code 255
2016-04-20 13:22:45 CEST [3786]: [3-1] db=,user= LOG:  startup process
(PID 3788) exited with exit code 1
2016-04-20 13:22:45 CEST [3786]: [4-1] db=,user= LOG:  aborting startup
due to startup process failure

Which was obviously caused by
ssh: connect to host <archive server> port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.0]

Now, the firewall does not let ssh through (yet), so the root cause is
quite obvious.

However, the docs[1] only state that:
"(...) if the command was terminated by a signal (other than SIGTERM,
which is used as part of a database server shutdown) or an error by the
shell (such as command not found), then recovery will abort and the
server will not start up."

In [2], Kevin Grittner stated that it might be that the commands RC
should by <= 255, otherwise it will be assessed as "failed badly; give up".

And indeed, after amending the restore_command with a "|| exit 1", the
server starts up just fine, using replication to fetch the missing WALs.

Which is ok for me right now as a workaround, however: had I found this
not while setting everything up from scratch, but in case of a disaster
(or simply a downtime or very high load of the archive server while
restarting a slave), this (basically undocumented!) behavior would have
caused me quite a headache...!

I reckon only few users will expect a connection timeout to fall into
the category of "command not found"...

Maybe the part "error by the shell (such as command not found)" could be
changed to "error by the shell (RC > 254, e.g. command not found or ssh
connection failure)" (actually, whatever the real behaviour is, I didn't
check the sources...)?

1
http://www.postgresql.org/docs/current/static/archive-recovery-settings.html
2
http://stackoverflow.com/questions/10524458/postgresql-9-1-streaming-replication-restore-command-spe...

Best regards,
-- 
Gunnar "Nick" Bluth
DBA ELSTER

Tel:   +49 911/991-4665
Mobil: +49 172/8853339

Attachments:

  [application/pgp-keys] 0xAD4790A7.asc (3.1K, 2-0xAD4790A7.asc)
  download

  [application/pgp-signature] signature.asc (836B, 3-signature.asc)
  download

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Actual RC of "restore_command" is relevant for DB startup
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox