Re: Basebackup fails without useful error message

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Basebackup fails without useful error message
7+ messages / 3 participants
[nested] [flat]

* Re: Basebackup fails without useful error message
@ 2024-09-29 15:01  Adrian Klaver <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Adrian Klaver @ 2024-09-29 15:01 UTC (permalink / raw)
  To: Koen De Groote <[email protected]>; PostgreSQL General <[email protected]>

On 9/29/24 07:48, Koen De Groote wrote:
> Having run a basebackup, I'm getting this output at the very end:
> 
> pg_basebackup: child process exited with error 1
> pg_basebackup: removing contents of data directory "/mnt/base_backup/dir/"

What is the complete command you are using?

> 
> Is there a way to get more information as to what exactly happened?

Have you looked at the Postgres log?

Is --verbose being used?

> 
> I'd like to look into fixing this or doing whatever is required so that 
> it doesn't happen again, but this just isn't enough info. Where do I 
> start looking?
> 
> Regards,
> Koen De Groote

-- 
Adrian Klaver
[email protected]







^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Basebackup fails without useful error message
@ 2024-09-29 15:57  Koen De Groote <[email protected]>
  parent: Adrian Klaver <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Koen De Groote @ 2024-09-29 15:57 UTC (permalink / raw)
  To: Adrian Klaver <[email protected]>; +Cc: PostgreSQL General <[email protected]>

> What is the complete command you are using?

The full command is:

pg_basebackup -h localhost -p 5432 -U basebackup_user -D
/mnt/base_backup/dir -Ft -z -P

So output Format as tar, gzipped, and with progress being printed.

> Have you looked at the Postgres log?

> Is --verbose being used?

This is straight from the logs, it's the only output besides the % progress
counter.

Will have a look at --verbose.

Regards,
Koen De Groote

On Sun, Sep 29, 2024 at 5:01 PM Adrian Klaver <[email protected]>
wrote:

> On 9/29/24 07:48, Koen De Groote wrote:
> > Having run a basebackup, I'm getting this output at the very end:
> >
> > pg_basebackup: child process exited with error 1
> > pg_basebackup: removing contents of data directory
> "/mnt/base_backup/dir/"
>
> What is the complete command you are using?
>
> >
> > Is there a way to get more information as to what exactly happened?
>
> Have you looked at the Postgres log?
>
> Is --verbose being used?
>
> >
> > I'd like to look into fixing this or doing whatever is required so that
> > it doesn't happen again, but this just isn't enough info. Where do I
> > start looking?
> >
> > Regards,
> > Koen De Groote
>
> --
> Adrian Klaver
> [email protected]
>
>


^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Basebackup fails without useful error message
@ 2024-09-29 16:08  Adrian Klaver <[email protected]>
  parent: Koen De Groote <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Adrian Klaver @ 2024-09-29 16:08 UTC (permalink / raw)
  To: Koen De Groote <[email protected]>; +Cc: PostgreSQL General <[email protected]>

On 9/29/24 08:57, Koen De Groote wrote:
>  > What is the complete command you are using?
> 
> The full command is:
> 
> pg_basebackup -h localhost -p 5432 -U basebackup_user -D 
> /mnt/base_backup/dir -Ft -z -P
> 
> So output Format as tar, gzipped, and with progress being printed.
> 
>  > Have you looked at the Postgres log?
> 
>  > Is --verbose being used?
> 
> This is straight from the logs, it's the only output besides the % 
> progress counter.
> 
> Will have a look at --verbose.

When you report on that and if it does not report the error then what is?:

Postgres version.

OS and version.

Anything special about the cluster like tablespaces, extensions, 
replication, etc.


> 
> Regards,
> Koen De Groote
> 

-- 
Adrian Klaver
[email protected]







^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Basebackup fails without useful error message
@ 2024-10-20 21:03  Koen De Groote <[email protected]>
  parent: Adrian Klaver <[email protected]>
  0 siblings, 2 replies; 7+ messages in thread

From: Koen De Groote @ 2024-10-20 21:03 UTC (permalink / raw)
  To: Adrian Klaver <[email protected]>; +Cc: PostgreSQL General <[email protected]>

Hello Adrian, and everyone else.

It has finally happened, the backup ran into an error again, and the
verbose output set me on the right path.

I'm getting this error message:

> pg_basebackup: could not receive data from WAL stream: server closed the
connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.

Combined with the main server logging:

> terminating walsender process due to replication timeout

Now, the server is set up with an archive_command which gzips the WAL files
and writes them to a network filesystem.

From looking at machine metrics at the time, my conclusion is the following:

At the time of the error, the remote filesystem experienced a very high
queue size for new writes.

So I'm assuming the process of writing WAL files, if there is an
archive_command set, is only considered to be finished after the archive is
written, not just when the WAL file is written in pg_wal.

I'm also seeing in the documentation that the default WAL method for
pg_basebackup is "stream", which waits for these WAL files as they are
produced.

I suspect that I have 2 possible paths at this point:

1: increase wal_sender_timeout
2: run the basebackup with --wal-method=none since my restore_command is
set up to explicitly go to the very same network storage to get the
archived WAL files.

I'm going to be testing this. If someone could confirm that this is how
writing WAL files works, that being: that it is only considered "done" when
the archive_command is done, that would be great.

Regards,
Koen De Groote

On Sun, Sep 29, 2024 at 6:08 PM Adrian Klaver <[email protected]>
wrote:

> On 9/29/24 08:57, Koen De Groote wrote:
> >  > What is the complete command you are using?
> >
> > The full command is:
> >
> > pg_basebackup -h localhost -p 5432 -U basebackup_user -D
> > /mnt/base_backup/dir -Ft -z -P
> >
> > So output Format as tar, gzipped, and with progress being printed.
> >
> >  > Have you looked at the Postgres log?
> >
> >  > Is --verbose being used?
> >
> > This is straight from the logs, it's the only output besides the %
> > progress counter.
> >
> > Will have a look at --verbose.
>
> When you report on that and if it does not report the error then what is?:
>
> Postgres version.
>
> OS and version.
>
> Anything special about the cluster like tablespaces, extensions,
> replication, etc.
>
>
> >
> > Regards,
> > Koen De Groote
> >
>
> --
> Adrian Klaver
> [email protected]
>
>

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Basebackup fails without useful error message
@ 2024-10-20 21:12  Adrian Klaver <[email protected]>
  parent: Koen De Groote <[email protected]>
  1 sibling, 0 replies; 7+ messages in thread

From: Adrian Klaver @ 2024-10-20 21:12 UTC (permalink / raw)
  To: Koen De Groote <[email protected]>; +Cc: PostgreSQL General <[email protected]>

On 10/20/24 14:03, Koen De Groote wrote:

> So I'm assuming the process of writing WAL files, if there is an 
> archive_command set, is only considered to be finished after the archive 
> is written, not just when the WAL file is written in pg_wal.

https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-ARCHIVING-WAL

"It is important that the archive command return zero exit status if and 
only if it succeeds. Upon getting a zero result, PostgreSQL will assume 
that the file has been successfully archived, and will remove or recycle 
it. However, a nonzero status tells PostgreSQL that the file was not 
archived; it will try again periodically until it succeeds."

> Regards,
> Koen De Groote
> 
> 

-- 
Adrian Klaver
[email protected]

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Basebackup fails without useful error message
@ 2024-10-21 22:34  David G. Johnston <[email protected]>
  parent: Koen De Groote <[email protected]>
  1 sibling, 1 reply; 7+ messages in thread

From: David G. Johnston @ 2024-10-21 22:34 UTC (permalink / raw)
  To: Koen De Groote <[email protected]>; +Cc: Adrian Klaver <[email protected]>; PostgreSQL General <[email protected]>

On Sunday, October 20, 2024, Koen De Groote <[email protected]> wrote:
>
>
> I'm going to be testing this. If someone could confirm that this is how
> writing WAL files works, that being: that it is only considered "done" when
> the archive_command is done, that would be great.
>

The archiving of WAL files by the primary does not involve a replication
connection of any sort and thus the “WAL sender” settings are not relevant
to it; or, here, whether or not you are archiving your WAL is immaterial
since you are streaming it as it gets produced.

If you are streaming WAL it seems highly unusual that you’d end up in a
situation where the connection goes idle long enough that it gets killed,
especially if the backup is still happening.  I’d probably go with
performing the backup under a disabled (or extremely large?) timeout though
and move on to other things.

That isn’t to say I fully understand what actually is happening here…

David J.

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Basebackup fails without useful error message
@ 2024-10-22 19:50  Koen De Groote <[email protected]>
  parent: David G. Johnston <[email protected]>
  0 siblings, 0 replies; 7+ messages in thread

From: Koen De Groote @ 2024-10-22 19:50 UTC (permalink / raw)
  To: David G. Johnston <[email protected]>; +Cc: Adrian Klaver <[email protected]>; PostgreSQL General <[email protected]>

Hello David,

I saw the backup fail. The backup logged that it terminated the walsender,
and correlating the moment it failed to the metrics of my storage, shows
the storage at that time was facing a huge IOWAIT. And this was a network
mounted storage.

The backup process continued, but because of a failure to stream WAL
without error(due to a local issue) the entire backup was marked as failed.
At the end, pg_basebackup will delete the backup, in this case. There's no
flag to control this final behavior.

I'll be testing restore soon without streaming WAL, since the actual
restore I perform doesn't use the pg_wal.tar.gz file. It gets the archived
WAL At least I think it doesn't need it, hence the need for testing.

Regards,
Koen De Groote

On Tue, Oct 22, 2024 at 12:34 AM David G. Johnston <
[email protected]> wrote:

> On Sunday, October 20, 2024, Koen De Groote <[email protected]> wrote:
>>
>>
>> I'm going to be testing this. If someone could confirm that this is how
>> writing WAL files works, that being: that it is only considered "done" when
>> the archive_command is done, that would be great.
>>
>
> The archiving of WAL files by the primary does not involve a replication
> connection of any sort and thus the “WAL sender” settings are not relevant
> to it; or, here, whether or not you are archiving your WAL is immaterial
> since you are streaming it as it gets produced.
>
> If you are streaming WAL it seems highly unusual that you’d end up in a
> situation where the connection goes idle long enough that it gets killed,
> especially if the backup is still happening.  I’d probably go with
> performing the backup under a disabled (or extremely large?) timeout though
> and move on to other things.
>
> That isn’t to say I fully understand what actually is happening here…
>
> David J.
>
>

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

end of thread, other threads:[~2024-10-22 19:50 UTC | newest]

Thread overview: 7+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-09-29 15:01 Re: Basebackup fails without useful error message Adrian Klaver <[email protected]>
2024-09-29 15:57 ` Koen De Groote <[email protected]>
2024-09-29 16:08   ` Adrian Klaver <[email protected]>
2024-10-20 21:03     ` Koen De Groote <[email protected]>
2024-10-20 21:12       ` Adrian Klaver <[email protected]>
2024-10-21 22:34       ` David G. Johnston <[email protected]>
2024-10-22 19:50         ` Koen De Groote <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox