
Archiving and Point in Time Recovery
___________________________________________
Simon Riggs     simon@2ndquadrant.com


1 Physical Database Backup and Recovery
=======================================
PostgreSQL supports two modes of protection:
- crash recovery protection
- archive recovery protection

Crash Recovery protection is automatically enabled. Should the server crash, it will recover itself quickly to the point of failure and continue if possible. It does this by re-reading the last few entries in the transaction log files.

Archive recovery protects you against the following situations, which more simple crash recovery is unable to protect against:
- accidental or malicious destruction or alteration of data
- accidentally or maliciously deleted database files
- loss of disk drives so that data is unobtainable
- loss of server 
- site disasters

Archive recovery protection is not enabled by default. To benefit from this level of protection, some careful thought and planning will be required by the administrator. Following on from this, testing of the set-up is highly recommended. Most importantly, a system of procedures and checks will be needed to ensure that everything works should unfortunate events occur.

Archive recovery also allows you to perform a Point in Time Recovery (PITR), to allow you to perform only a partial recovery should you wish to undo the effects of accidental or malicious damage to your data.

1.1 Overview of Archive Recovery Protection
===========================================
The main rule is: when you set archive_command, take a backup.

To benefit from archive recovery protection you must configure your server to utilise an archive_command, specified in postgresql.conf. When this has been correctly configured we say we are operating in archive mode.

In archive mode, when a transaction log fills, PostgreSQL executes the archive_command to copy transaction log files to a secure backup location of your choosing, allowing you to protect them for when you need to recover. This clearly requires proper prior planning to ensure recovery is possible.

Should a crash occur in archive mode, then PostgreSQL will perform crash recovery normally. If a serious crash occurs, you will need to restore and invoke archive recovery.

Archive recovery starts from a backup and rolls forward using the transaction log records until recovery completes. Archive recovery of a PostgreSQL database cluster is only possible if you have both of these data sets available and intact:
- a full physical backup of the database
This is a copy of the files that make up the database, including all data files, tablespace directories etc. It need not include the pg_xlog directory. This does not refer to the use of the pg_dump utility.
- a full sequence of archived transaction log files
These two sets of data files must have been produced by the same original database, otherwise recovery will be very unlikely to start at all, and even if it does will fail again very quickly.

A full sequence of archived transaction log files forms a chain that links the backup to the latest point. If you break the chain i.e. the two sets of files don't match, or you've deleted a transaction log or lost a tape - then you will only be able to recover up to the point the chain broke. The longer the chain, the easier it will be to break, so backup regularly and work out a robust archiving policy. If you want to rescue a drowning man - don't use a rope that has been cut into two pieces...

Both crash recovery and archive recovery use the same logic to reapply data changes - the difference between them is that archive recovery uses restored data files and has a number of additional options. Transaction logs will be restored for use directly from archive in a stream that prevents disk space from overflowing. As a result, very long archive recoveries are possible, should you need to restore from a very old backup, or restore onto a server with much reduced disk space in comparison to the original server.

Recovery occurs much faster than the rate at which data was originally written to the transaction logs. You should test this on your own systems, though you should expect anywhere between x2 and x20 recovery rate, e.g. 12 hours of transaction logs could take less than an hour to recover. It is the total volume of the transaction logs that effects this time, not the actual time elapsed for writing the original transaction log files. 

In summary, it is practical in many circumstances to take a daily hot backup of the server, then rely upon archived transaction logs to peform archive recovery in the event of a server crash.

You are encouraged to use specifically designed software, open-source or closed-source, to automate and manage the backup and restore process, rather than to rely on your own hand-scripted versions. In the same way that you use PostgreSQL to manage the access to and writing of your data, utilise other specially designed software to manage the backup and restore (BAR) process.

Note that there will always be a small amount of data lost when archive recovery is used, since the current transaction log file that was current at the time of the crash will not yet have been copied to archive. Even higher availability with PostgreSQL is possible using one of the many specifically designed PostgreSQL replication server add-ons.

1.2 Simple Archive Recovery Situation
=======================================
If you would like to benefit from Archive Recovery protection.
1. Decide an archive destination where you will safely store your backups and archived transaction log files.
2. Edit postgresql.conf and set archive_command so that archived transaction logs are copied to the archive destination, using whatever method you think appropriate.
3. Use pg_ctl reload to update the postmaster's server parameters. An archiver process will start and begin to archive transaction log files when they become full.
4. Immediately backup the complete data directory, including all tablespaces, but excluding the pg_xlog sub-directory.
5. That's it, you're protected. You should now take regular full database backups and monitor the archiver to check that files are being correctly copied to archive and can be read back intact from the archive destination. You should begin to see archiver messages being written to the server log.

Should recovery be required:
1. Restore the latest full backup of the database
2. Copy the file share/recovery.conf.sample to the data directory of the cluster to be recovered. Rename the file to recovery.conf and then edit to specify a restore_command that will copy files back from the archive (which may be a different tape drive etc..).
3. Startup the PostgreSQL server and watch it recover

Life cannot be relied upon to be that simple, so various additional options and warnings are discussed here also.

1.3 Backup
==========

Archive recovery must use a full physical backup of the database as a base from which to rollforward. 

You have the option of taking
- a "cold" backup - when the server has been fully shutdown (smart or fast, NOT immediate)
- a "hot" backup - while the server is still running

A cold backup may be restored on its own, even if archived transaction log files are not available. No recovery time will be required to startup from this backup, though the data within it will be out of date.

Archive recovery may use either a hot or a cold backup from which to start rolling forwards in time. A hot backup is only useful when used in conjunction with archived transaction log files as part of the archive recovery process. If archived transaction log files are not available then hot backups are useless - they contain an inconsistent database state, even assuming you could work out a way to read them. So, if you have not set archive_command then hot backups should not be taken at all.

A cold backup taken after an immediate shutdown is treated as a hot backup for recovery purposes.

Whether you choose hot or cold backups, you must include ALL files within the data directory of the PostgreSQL database cluster you whish to backup, plus all tablespaces and all other associated files. The ONLY exception to this is that you may exclude the pg_xlog directory and all its descendants from your backup.

This is a copy of the files that make up the database, including all data files, tablespace directories etc. It need not include the pg_xlog directory. This does not refer to the use of the pg_dump utility.

The database backup chosen as the base for the archive recovery must have been taken after archive mode was enabled.

The backup method you choose is not important to PostgreSQL.

Your options are:
- a simple tar command, copying data files to tape or disk
- an scp command to copy data to another server
- using disk software or hardware that allows a frozen snapshot or split mirror copy of the complete database to be taken very quickly

1.4 Archiving Transaction Logs
==============================

When archive mode is enabled archiving is performed by a new process underneath the postmaster, called archiver. The archiver is started at the end of database recovery, so there is never a conflict between an extended recovery and the archiver.

When users make changes to the database, records are written to the transaction log files, which eventually become full and records are then written to the next log file. At the point when a transaction log fills, the archiver is immediately notified and begins copying the recently filled file away to the archive destination - the file is not written to again by PostgreSQL. The notification mechanism is robust, so will still function correctly as long as there are backends writing to the database, even if the postmaster and/or other backends have crashed, for whatever reason.

Note that the archiver only operates when in archive mode. When not in archive mode, not files will be archived. If archive mode is disabled and then afterwards re-enabled, you are warned that there is no attempt to archive any files that filled during the period where archive mode was disabled. This will break the chain of archived transaction log files and recovery will stop at that point. The main rule is: when you set archive_command, take a backup.

When operating in archive mode, you will still see log messages saying "recycled transaction log ....". The checkpoint process recycles transaction log files whether operating in archive mode or not. In archive mode, files will not be recycled until they have been successfully archived.

If archive fails, the archiver will retry the command, with up to 3 re-tries.

If archiving fails, then an ERROR message will be written to the log and the server will continue. In this case, it is possible that the disk space available to pg_xlog will fill up. Currently, should this occur, the server will eventually crash. The administrator should monitor the archival process to ensure its continued success, just as you would monitor the successful completion of backup jobs.

Errors might occur for a number of reasons depending upon the method selected for archiving. If archiving to tape, a manual tape change may be required. If archiving to a remote network, a dial-up may have dropped or the destination server may be restarting. Many error situations will be specific to your equipment and its configuration.

As a general rule, pg_xlog should be given space equivalent to 10 times the setting of checkpoint_files. This should give the administrator sufficient time to be notified and then rectify any faults or provide any manual interventions required. The actual space requirement will vary according to your application's usage profile.

1.5 Archive and Restore Commands
=================================

During both archive and restore, PostgreSQL calls a shell command that you provide.

For example, this command could be:
- a simple shell command such as cp, scp, or rsync
- a shell script or program with more complex behaviour
- a program call to Backup and Recovery (BAR) system to perform the desired action

As a result, PostgreSQL can be configured to work with any open or closed source BAR system that you currently use or plan to use in the future.

The requirements for that program are that it must
- return a zero return code if it succeeds and return a nonzero return code if it does not succeed
- be initiated from the PostgreSQL-owning operating system userid
- designed to manage one file at a time
- be responsible for managing any other uniqueness information required to guarantee that transaction log files from different servers can be uniquely identified

Transaction log files are archived and recovered one at a time. They should always be archived in the order they were written, as well as being sequentially numbered to ensure human readability of filenames. At restore time the files are requested by name, but are again requested in the order they were written. This mechanism is optimised for tape (or other sequential) backup systems. 

Some archival mechanisms require you to wait while the file is fetched. Such delays could have a major effect on overall recovery time. If any file read-ahead or buffering is required to increase performance, then this will need to be designed and written into the archive/recovery command.

Please check contrib for any sample scripts that may have been contributed to allow PostgreSQL to work in various configurations, or connected to various BAR systems.

Using thes features, it is possible to implement an Automated
Standby Database. This is an active-passive High Availability option. In
this mode, the main server sends archived log files to a second, standby
server. The standby server is set up to be in permanent recovery, by
using a restore command that waits for each file to be shipped to it. The
standby system receives each file, then recovers up to that point - so
the standby system is always a few seconds from completing its startup
should it be required.

1.6 Performing Archive Recovery
===============================

If you think you need to perform archive recovery then you should:
1. immediately backup any and all datafiles still available to make sure you have a known position to return to while you ATTEMPT recovery - you may make a mistake and need to retry. If you do, or think you have: Be cool - we all make mistakes, just don't let your errors multiply.
2. contact the owner of the data immediately. You may need them to make a decision on which way to proceed.

To avoid confusion, PostgreSQL notes a difference between thse two terms
- restore - the task of moving data onto a system from an archive destination
- recovery - using the restored data to reconstruct the database to a consistent, correct and more recent state than was available solely from the backup

Archive Recovery consists of these steps:
1. Restore the full database backup
2. Check whether pg_xlog files have been restored, if so decide what to do with these based upon what type of recovery is being performed
2. Set the recovery configuration parameters in the file recovery.conf
3. Start the database system

To enter Archive Recovery, recovery.conf must be present. When it is present, it will force the database system into recovery mode, whatever the restored control file says. Copy the file share/recovery.conf.sample to the data directory of the cluster to be recovered. Rename the file to recovery.conf and then edit to specify a restore_command that will copy files back from the archive (which may be a different tape drive etc..). restore_command is the only mandatory parameter within recovery.conf. You do not have to perform a Point in Time Recovery, that is optional.

recovery.conf specifies an additional set of temporary configuration paramters that will be used during the recovery phase. Once recovery is complete, these configuration parameters will no longer be used and the server will operate normally. recovery.conf is renamed recovery.done.

Should the server crash (or crash again) after archive recovery it will automatically enter crash recovery, ignoring the previous settings in the recovery.conf. The server will only ever enter archive recovery at startup when a recovery.conf file exists in the data directory for that server. Should the server crash during archive recovery, then you will need to restart the recovery process from the beginning again. PostgreSQL will not perform this automatically in many circumstances. If it can, it will restart from the point just after the full database restore.

Transaction log files are restored one at a time. The files are requested by name, though are requested in the sequential order they were written. Once used in the recovery process, restored files are discarded to save disk space and avoid any confusion that might occur.

When recovering from total system failure bare metal or disaster recovery, you will always be missing the file that was current when disaster struck. As a result, you should expect to lose some transactions that had been committed in the last few seconds prior to crash - this is known as the recovery lost time window. On average, this will be 50% of the time between log file writes. This might be anywhere between a few seconds or a few minutes, depending upon your application. Check the time differences between file write to see how long this is at both peak and off-peak for your application.

If the potential loss of data is too long, you may wish to consider reducing the size of xlog files, or moving onto a replication solution for higher availability.

If you are recovering from a total system failure and your full system backup has been restored WITH some existing transaction logs, these should be moved into another directory before you begin recovery. If you do not, then PostgreSQL will attempt to archive these as soon as recovery completes. If you recover to end of logs, these will have the same name as existing archived logs and problems will occur later should you try to re-recover. These should NOT be deleted or moved IF you are performing a Point in Time Recovery.

1.7 Recovery.conf Settings
==========================

recovery.conf should be placed in the data directory of the server which is to be recovered. The file is  read on postmaster startup; it is not possible to alter archive recovery parameters during recovery by issuing a reload to the PostgreSQL server. After successful recovery, it is renamed to recovery.done to ensure we do not accidentally re-enter archive recovery mode.

recovery.conf.sample is available in the share directory and can be edited to suit the needs of your recovery situation. You should carefully read the meaning of each parameter before using it.

restore command
---------------
Specifies the shell command that is executed to copy log files back from archival storage.

The command will be executed once for each transaction log file, so will likely be called many times during a recovery. No state information is passed, so if it is important to remember information between calls, then it is up to the admin to ensure the command does this, e.g. to avoid tape rewinds between calls.

The recovery process does not know when the chain of transaction logs ends, so will keep requesting files until there are no more in the archive (or it stops earlier when finding a recovery_target, see later). As a result, the program will very likely request files that are not present. File not found must return a nonzero exit status, as should all other error states. All nonzero exit status codes are interpreted as file not found, so it you should re-run a recovery if an unexpected error message is received.

Within the command string, you should specify two parameters: %f and %p. %f is the name of the desired log file, without any path or location information. %p is the absolute path of the file to copy the log file to. If you confuse these then the restore will not function correctly.

recovery_target parameters
--------------------------
The default behaviour is to recover all the way until the end of the transaction log is reached (or the end of whatever part of it is still available). 

If you wish to stop before the end of the transaction log is reached, you must specify a recovery_target. The recovery_target can be specified as either a transaction id (xid) or a time, but not both. Recovery can only halt when it reaches a commit or an abort record. 

If you specify a recovery_target_xid then recovery will stop when exactly that xid commits or aborts. Recovery will not stop until end of logs if the recovery_target_xid is not found, which could occur if the transaction was still in progress when the server crashed, or because you have specified too high a value. Recovery will not stop when it finds the commit record of a higher xid. This is because xids are assigned sequentially according to when transactions start; transactions don't end in the sequence they started.

If you specify recovery_target_inclusive then recovery will include the recovery target xid, otherwise it will exclude (and therefore effectively abort) the recovery target xid.

Transaction log timestamps are not necessarily unique. If you specify recovery_target_time and recovery_target_inclusive=true, then recovery will stop at the record immediately after the last commit/abort record that matches the recovery_target_time. If you specify recovery_target_time and recovery_target_inclusive=false then recovery will stop just before the first commit/abort that matches that time.

Some additional notes on recovery_target_time are:
- the time can be specified in any timezone, though you should be aware that the actual target is the timezone that the server was operating in when the transaction logs were written. You should take extra care when restoring near a Daylight Saving Time date boundary, or should you or any co-workers attempt to recover the database onto a server in a different timezone.
- should the system clock be set backwards in time there will be a discontinuity in time values in the log records. In that case, there would be some records that would not be valid candidates for a recovery target when using a time value. In those situations, you should specify a recovery_target_xid. 
All other functions will continue normally through these situations - only the recovery_target_time is effected.

recovery_target_timeline
------------------------
If you have recently recovered the database server and subsequently discovered that the recovery was, for whatever reason, unacceptable then PostgreSQL offers you further alternatives. This is the ONLY time that you should use this parameter to change the default behaviour of the recovery process.

See the section on multiple recovery before deciding how to set this parameter.

1.8 Multiple Recovery Situations
================================

PostgreSQL archive recovery has been designed to be fully and consistently recovered even in the face of complex disaster and recovery situations.

Real-world experience shows that if a database system crashes, it unfortunately may often crash again soon afterwards. Also, since sever crashes are rare and very stressful, we may need to re-evaluate earlier decisions or actions and so may need to repeat the recovery process again within hours or days.

PostgreSQL implements a feature known as timelines to cope with the complexities of multiple recoveries.

Timelines do not effect the normal running of the server in any way. A new timeline is only created when we perform a recovery that does not run all the way to the end of logs on the latest timeline. If you then later wish to recover the system again, you must decide which timeline you wish to recover to, since you now have two options: the original and the post-recovery version of history. This option may seem like an extra uneccessary decision, though in serious recovery situations you will be glad to have a few options to consider.

PostgreSQL allows considerable complexity in this regard, with timelines being automatically managed to help prevent various type of errors.

An example of a situation that you may encounter could be:
1. We are running happily, but spot a rogue transaction that we wish to
expunge. We decide to run a PITR up to that txnid, but not include it. We do an archive
recovery to a recovery_target_xid. We have available to us local copies
of the xlogs if required but choose not to utilise them.

2. After operating for an hour, we decide that the previous recovery
was an extremely bad idea and decide to recover back to a point
AFTER the original recovery_target - i.e. to try to pretend we
had never attempted PITR in the first place because there was some even
more important data just recently committed we didn't know about. We run a recovery using the original recovery_target_timeline value and a recovery_target_time.

3. Our recovery fails because the then-current log, which has not
been archived, was deleted because we wouldn't need it anymore. We
decide that we made the right choice in the first place and decide to
re-run the PITR, though to a point slightly ahead of where we stopped
last time we tried that. This works.

4. After operating successfully for three hours the rogue transaction happens again. We realise that the rogue transaction is in 
fact a deliberate security violation. We immediately close network
access and try to recover. Management decides we must accept the first
rogue transaction's effects, but the second is too large to be
acceptable. We need to recover to a recovery_target_xid prior to the
second attack. The first recovery meant that xids were being reused (on
a different timeline) and so the xid we wish to recover MAY exist on
both the first and second timeline. To ensure we don't recover the wrong
transactions, we decide to specify we wish to recover to a
recovery_target_xid on recovery_target_timeline = latest.

[really need a diagram here]

...hopefully stopping the panic at some point.

How does PostgreSQL handle this?

Up to now, we have discussed Crash Recovery and Archive Recovery. The latter had two modes: to end of logs and stop at a recovery target. Timelines provide another 2 modes to help you during complex multiple recovery situations.

So, overall there are be 5 recovery modes with 4 levels of default, summarised in the following table. This also shows what set of actions/parameters we need to specify to invoke that mode...

----------------SPECIFIED ITEMs
MODE            recovery.conf   recovery_target*     target_in_timeline
crash recovery
- up to 7.4-------------no------no--------------no
archive recovery 
- current end-----------yes-----no--------------no
- current target (*)----yes-----yes-------------no
  current timeline
- other target (*)------yes-----either----------yes, = 'latest'
  latest timeline
- other target (*)------yes-----either----------yes, = '<value>'
  other timeline

(*) these operations cause a new timeline to be created

More verbosely (not numbered because they're NOT a sequence or
progression)

- if no recovery.conf is present we do crash recovery to end of logs on
pg_control timeline (i.e. current) 

- if recovery.conf is present and we do not specify a target we do
archive recovery to end of logs on pg_control timeline (i.e. current) 

- if recovery.conf is present and we specify a target, but no timeline,
then we do archive recovery on the pg_control timeline only, and assume
that the target was supposed to be on this, even if we don't find it

- if recovery.conf is present and we specify a timeline of literally
'latest' (without having to know what that is) - then we search archive
for the latest history file, then we do archive recovery from the
pg_control timeline to the latest timeline as specified in the latest
history file. If we specify a target, then this is searched for on
whatever timeline we're on as we rollforward.

- if recovery.conf is present and we specify a timeline - then we search
archive for that history file, then we do archive recovery from the
pg_control timeline to the specified timeline as shown in that history
file. If we specify a target, then this is searched for on whatever
timeline we're on as we rollforward.
