public inbox for [email protected]help / color / mirror / Atom feed
Re: Basebackup fails without useful error message 7+ messages / 3 participants [nested] [flat]
* Re: Basebackup fails without useful error message @ 2024-09-29 15:01 Adrian Klaver <[email protected]> 0 siblings, 1 reply; 7+ messages in thread From: Adrian Klaver @ 2024-09-29 15:01 UTC (permalink / raw) To: Koen De Groote <[email protected]>; PostgreSQL General <[email protected]> On 9/29/24 07:48, Koen De Groote wrote: > Having run a basebackup, I'm getting this output at the very end: > > pg_basebackup: child process exited with error 1 > pg_basebackup: removing contents of data directory "/mnt/base_backup/dir/" What is the complete command you are using? > > Is there a way to get more information as to what exactly happened? Have you looked at the Postgres log? Is --verbose being used? > > I'd like to look into fixing this or doing whatever is required so that > it doesn't happen again, but this just isn't enough info. Where do I > start looking? > > Regards, > Koen De Groote -- Adrian Klaver [email protected] ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Basebackup fails without useful error message @ 2024-09-29 15:57 Koen De Groote <[email protected]> parent: Adrian Klaver <[email protected]> 0 siblings, 1 reply; 7+ messages in thread From: Koen De Groote @ 2024-09-29 15:57 UTC (permalink / raw) To: Adrian Klaver <[email protected]>; +Cc: PostgreSQL General <[email protected]> > What is the complete command you are using? The full command is: pg_basebackup -h localhost -p 5432 -U basebackup_user -D /mnt/base_backup/dir -Ft -z -P So output Format as tar, gzipped, and with progress being printed. > Have you looked at the Postgres log? > Is --verbose being used? This is straight from the logs, it's the only output besides the % progress counter. Will have a look at --verbose. Regards, Koen De Groote On Sun, Sep 29, 2024 at 5:01 PM Adrian Klaver <[email protected]> wrote: > On 9/29/24 07:48, Koen De Groote wrote: > > Having run a basebackup, I'm getting this output at the very end: > > > > pg_basebackup: child process exited with error 1 > > pg_basebackup: removing contents of data directory > "/mnt/base_backup/dir/" > > What is the complete command you are using? > > > > > Is there a way to get more information as to what exactly happened? > > Have you looked at the Postgres log? > > Is --verbose being used? > > > > > I'd like to look into fixing this or doing whatever is required so that > > it doesn't happen again, but this just isn't enough info. Where do I > > start looking? > > > > Regards, > > Koen De Groote > > -- > Adrian Klaver > [email protected] > > ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Basebackup fails without useful error message @ 2024-09-29 16:08 Adrian Klaver <[email protected]> parent: Koen De Groote <[email protected]> 0 siblings, 1 reply; 7+ messages in thread From: Adrian Klaver @ 2024-09-29 16:08 UTC (permalink / raw) To: Koen De Groote <[email protected]>; +Cc: PostgreSQL General <[email protected]> On 9/29/24 08:57, Koen De Groote wrote: > > What is the complete command you are using? > > The full command is: > > pg_basebackup -h localhost -p 5432 -U basebackup_user -D > /mnt/base_backup/dir -Ft -z -P > > So output Format as tar, gzipped, and with progress being printed. > > > Have you looked at the Postgres log? > > > Is --verbose being used? > > This is straight from the logs, it's the only output besides the % > progress counter. > > Will have a look at --verbose. When you report on that and if it does not report the error then what is?: Postgres version. OS and version. Anything special about the cluster like tablespaces, extensions, replication, etc. > > Regards, > Koen De Groote > -- Adrian Klaver [email protected] ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Basebackup fails without useful error message @ 2024-10-20 21:03 Koen De Groote <[email protected]> parent: Adrian Klaver <[email protected]> 0 siblings, 2 replies; 7+ messages in thread From: Koen De Groote @ 2024-10-20 21:03 UTC (permalink / raw) To: Adrian Klaver <[email protected]>; +Cc: PostgreSQL General <[email protected]> Hello Adrian, and everyone else. It has finally happened, the backup ran into an error again, and the verbose output set me on the right path. I'm getting this error message: > pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. Combined with the main server logging: > terminating walsender process due to replication timeout Now, the server is set up with an archive_command which gzips the WAL files and writes them to a network filesystem. From looking at machine metrics at the time, my conclusion is the following: At the time of the error, the remote filesystem experienced a very high queue size for new writes. So I'm assuming the process of writing WAL files, if there is an archive_command set, is only considered to be finished after the archive is written, not just when the WAL file is written in pg_wal. I'm also seeing in the documentation that the default WAL method for pg_basebackup is "stream", which waits for these WAL files as they are produced. I suspect that I have 2 possible paths at this point: 1: increase wal_sender_timeout 2: run the basebackup with --wal-method=none since my restore_command is set up to explicitly go to the very same network storage to get the archived WAL files. I'm going to be testing this. If someone could confirm that this is how writing WAL files works, that being: that it is only considered "done" when the archive_command is done, that would be great. Regards, Koen De Groote On Sun, Sep 29, 2024 at 6:08 PM Adrian Klaver <[email protected]> wrote: > On 9/29/24 08:57, Koen De Groote wrote: > > > What is the complete command you are using? > > > > The full command is: > > > > pg_basebackup -h localhost -p 5432 -U basebackup_user -D > > /mnt/base_backup/dir -Ft -z -P > > > > So output Format as tar, gzipped, and with progress being printed. > > > > > Have you looked at the Postgres log? > > > > > Is --verbose being used? > > > > This is straight from the logs, it's the only output besides the % > > progress counter. > > > > Will have a look at --verbose. > > When you report on that and if it does not report the error then what is?: > > Postgres version. > > OS and version. > > Anything special about the cluster like tablespaces, extensions, > replication, etc. > > > > > > Regards, > > Koen De Groote > > > > -- > Adrian Klaver > [email protected] > > ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Basebackup fails without useful error message @ 2024-10-20 21:12 Adrian Klaver <[email protected]> parent: Koen De Groote <[email protected]> 1 sibling, 0 replies; 7+ messages in thread From: Adrian Klaver @ 2024-10-20 21:12 UTC (permalink / raw) To: Koen De Groote <[email protected]>; +Cc: PostgreSQL General <[email protected]> On 10/20/24 14:03, Koen De Groote wrote: > So I'm assuming the process of writing WAL files, if there is an > archive_command set, is only considered to be finished after the archive > is written, not just when the WAL file is written in pg_wal. https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-ARCHIVING-WAL "It is important that the archive command return zero exit status if and only if it succeeds. Upon getting a zero result, PostgreSQL will assume that the file has been successfully archived, and will remove or recycle it. However, a nonzero status tells PostgreSQL that the file was not archived; it will try again periodically until it succeeds." > Regards, > Koen De Groote > > -- Adrian Klaver [email protected] ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Basebackup fails without useful error message @ 2024-10-21 22:34 David G. Johnston <[email protected]> parent: Koen De Groote <[email protected]> 1 sibling, 1 reply; 7+ messages in thread From: David G. Johnston @ 2024-10-21 22:34 UTC (permalink / raw) To: Koen De Groote <[email protected]>; +Cc: Adrian Klaver <[email protected]>; PostgreSQL General <[email protected]> On Sunday, October 20, 2024, Koen De Groote <[email protected]> wrote: > > > I'm going to be testing this. If someone could confirm that this is how > writing WAL files works, that being: that it is only considered "done" when > the archive_command is done, that would be great. > The archiving of WAL files by the primary does not involve a replication connection of any sort and thus the “WAL sender” settings are not relevant to it; or, here, whether or not you are archiving your WAL is immaterial since you are streaming it as it gets produced. If you are streaming WAL it seems highly unusual that you’d end up in a situation where the connection goes idle long enough that it gets killed, especially if the backup is still happening. I’d probably go with performing the backup under a disabled (or extremely large?) timeout though and move on to other things. That isn’t to say I fully understand what actually is happening here… David J. ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: Basebackup fails without useful error message @ 2024-10-22 19:50 Koen De Groote <[email protected]> parent: David G. Johnston <[email protected]> 0 siblings, 0 replies; 7+ messages in thread From: Koen De Groote @ 2024-10-22 19:50 UTC (permalink / raw) To: David G. Johnston <[email protected]>; +Cc: Adrian Klaver <[email protected]>; PostgreSQL General <[email protected]> Hello David, I saw the backup fail. The backup logged that it terminated the walsender, and correlating the moment it failed to the metrics of my storage, shows the storage at that time was facing a huge IOWAIT. And this was a network mounted storage. The backup process continued, but because of a failure to stream WAL without error(due to a local issue) the entire backup was marked as failed. At the end, pg_basebackup will delete the backup, in this case. There's no flag to control this final behavior. I'll be testing restore soon without streaming WAL, since the actual restore I perform doesn't use the pg_wal.tar.gz file. It gets the archived WAL At least I think it doesn't need it, hence the need for testing. Regards, Koen De Groote On Tue, Oct 22, 2024 at 12:34 AM David G. Johnston < [email protected]> wrote: > On Sunday, October 20, 2024, Koen De Groote <[email protected]> wrote: >> >> >> I'm going to be testing this. If someone could confirm that this is how >> writing WAL files works, that being: that it is only considered "done" when >> the archive_command is done, that would be great. >> > > The archiving of WAL files by the primary does not involve a replication > connection of any sort and thus the “WAL sender” settings are not relevant > to it; or, here, whether or not you are archiving your WAL is immaterial > since you are streaming it as it gets produced. > > If you are streaming WAL it seems highly unusual that you’d end up in a > situation where the connection goes idle long enough that it gets killed, > especially if the backup is still happening. I’d probably go with > performing the backup under a disabled (or extremely large?) timeout though > and move on to other things. > > That isn’t to say I fully understand what actually is happening here… > > David J. > > ^ permalink raw reply [nested|flat] 7+ messages in thread
end of thread, other threads:[~2024-10-22 19:50 UTC | newest] Thread overview: 7+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2024-09-29 15:01 Re: Basebackup fails without useful error message Adrian Klaver <[email protected]> 2024-09-29 15:57 ` Koen De Groote <[email protected]> 2024-09-29 16:08 ` Adrian Klaver <[email protected]> 2024-10-20 21:03 ` Koen De Groote <[email protected]> 2024-10-20 21:12 ` Adrian Klaver <[email protected]> 2024-10-21 22:34 ` David G. Johnston <[email protected]> 2024-10-22 19:50 ` Koen De Groote <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox