MIME-Version: 1.0
References: <CAGbX52Fg2jxqGVZzeQ_QcHSZ8fgDjnVZUJy5NUb5-PAf8fvxkw@mail.gmail.com>
 <52afa4c9-7393-4265-88bb-6393f1b0fb03@aklaver.com> <CAGbX52EyjjOQJUjA0m4+0azs_yH1by63p6hhJg7Y66=VCCqzpA@mail.gmail.com>
 <45e8c44b-2506-41b5-b999-5fdc42472644@aklaver.com>
In-Reply-To: <45e8c44b-2506-41b5-b999-5fdc42472644@aklaver.com>
From: Koen De Groote <kdg.dev@gmail.com>
Date: Sun, 20 Oct 2024 23:03:51 +0200
Message-ID: <CAGbX52ENsSHKoTyu5+XfN1o1bZ2w2CJaE1oQnxcm=fj2SyoZXg@mail.gmail.com>
Subject: Re: Basebackup fails without useful error message
To: Adrian Klaver <adrian.klaver@aklaver.com>
Cc: PostgreSQL General <pgsql-general@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="00000000000039b5fc0624eee06e"
Archived-At: <https://www.postgresql.org/message-id/CAGbX52ENsSHKoTyu5%2BXfN1o1bZ2w2CJaE1oQnxcm%3Dfj2SyoZXg%40mail.gmail.com>
Precedence: bulk

--00000000000039b5fc0624eee06e
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hello Adrian, and everyone else.

It has finally happened, the backup ran into an error again, and the
verbose output set me on the right path.

I'm getting this error message:

> pg_basebackup: could not receive data from WAL stream: server closed the
connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.

Combined with the main server logging:

> terminating walsender process due to replication timeout

Now, the server is set up with an archive_command which gzips the WAL files
and writes them to a network filesystem.

From looking at machine metrics at the time, my conclusion is the following=
:

At the time of the error, the remote filesystem experienced a very high
queue size for new writes.

So I'm assuming the process of writing WAL files, if there is an
archive_command set, is only considered to be finished after the archive is
written, not just when the WAL file is written in pg_wal.

I'm also seeing in the documentation that the default WAL method for
pg_basebackup is "stream", which waits for these WAL files as they are
produced.

I suspect that I have 2 possible paths at this point:

1: increase wal_sender_timeout
2: run the basebackup with --wal-method=3Dnone since my restore_command is
set up to explicitly go to the very same network storage to get the
archived WAL files.

I'm going to be testing this. If someone could confirm that this is how
writing WAL files works, that being: that it is only considered "done" when
the archive_command is done, that would be great.

Regards,
Koen De Groote


On Sun, Sep 29, 2024 at 6:08=E2=80=AFPM Adrian Klaver <adrian.klaver@aklave=
r.com>
wrote:

> On 9/29/24 08:57, Koen De Groote wrote:
> >  > What is the complete command you are using?
> >
> > The full command is:
> >
> > pg_basebackup -h localhost -p 5432 -U basebackup_user -D
> > /mnt/base_backup/dir -Ft -z -P
> >
> > So output Format as tar, gzipped, and with progress being printed.
> >
> >  > Have you looked at the Postgres log?
> >
> >  > Is --verbose being used?
> >
> > This is straight from the logs, it's the only output besides the %
> > progress counter.
> >
> > Will have a look at --verbose.
>
> When you report on that and if it does not report the error then what is?=
:
>
> Postgres version.
>
> OS and version.
>
> Anything special about the cluster like tablespaces, extensions,
> replication, etc.
>
>
> >
> > Regards,
> > Koen De Groote
> >
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com
>
>

--00000000000039b5fc0624eee06e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello Adrian, and everyone else.<div><br></div><div>It has=
 finally happened, the backup ran into an error again, and the verbose outp=
ut set me on the right path.</div><div><br></div><div>I&#39;m getting this =
error message:</div><br>&gt; pg_basebackup: could not receive data from WAL=
 stream: server closed the connection unexpectedly<br>&gt; This probably me=
ans the server terminated abnormally<br>&gt; before or while processing the=
 request.<div><br></div><div>Combined with the main server logging:</div><d=
iv><br></div><div>&gt; terminating walsender process due to replication tim=
eout<br><div><br></div><div>Now, the server is set up with an archive_comma=
nd which gzips the WAL files and writes them to a network filesystem.</div>=
<div><br></div><div>From looking at machine metrics at the time, my conclus=
ion is the following:</div><div><br></div><div>At the time of the error, th=
e remote filesystem experienced a very high queue size for new writes.</div=
><div><br></div><div>So I&#39;m assuming the process of writing WAL files, =
if there is an archive_command set, is only considered to be finished after=
 the archive is written, not just when the WAL file is written in pg_wal.</=
div><div><br></div><div>I&#39;m also seeing in the documentation that the d=
efault WAL method for pg_basebackup is &quot;stream&quot;, which waits for =
these WAL files as they are produced.</div><div><br></div><div>I suspect th=
at I have 2 possible paths at this point:</div><div><br></div><div>1: incre=
ase=C2=A0wal_sender_timeout</div><div>2: run the basebackup with=C2=A0--wal=
-method=3Dnone since my restore_command is set up to explicitly go to the v=
ery same network storage to get the archived WAL files.</div><div><br></div=
><div>I&#39;m going to be testing this. If someone could confirm that this =
is how writing WAL files works, that being: that it is only considered &quo=
t;done&quot; when the archive_command is done, that would be great.</div></=
div><div><br></div><div>Regards,</div><div>Koen De Groote</div><div><br></d=
iv></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_att=
r">On Sun, Sep 29, 2024 at 6:08=E2=80=AFPM Adrian Klaver &lt;<a href=3D"mai=
lto:adrian.klaver@aklaver.com">adrian.klaver@aklaver.com</a>&gt; wrote:<br>=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex">On 9/29/24 08:57, K=
oen De Groote wrote:<br>
&gt;=C2=A0 &gt; What is the complete command you are using?<br>
&gt; <br>
&gt; The full command is:<br>
&gt; <br>
&gt; pg_basebackup -h localhost -p 5432 -U basebackup_user -D <br>
&gt; /mnt/base_backup/dir -Ft -z -P<br>
&gt; <br>
&gt; So output Format as tar, gzipped, and with progress being printed.<br>
&gt; <br>
&gt;=C2=A0 &gt; Have you looked at the Postgres log?<br>
&gt; <br>
&gt;=C2=A0 &gt; Is --verbose being used?<br>
&gt; <br>
&gt; This is straight from the logs, it&#39;s the only output besides the %=
 <br>
&gt; progress counter.<br>
&gt; <br>
&gt; Will have a look at --verbose.<br>
<br>
When you report on that and if it does not report the error then what is?:<=
br>
<br>
Postgres version.<br>
<br>
OS and version.<br>
<br>
Anything special about the cluster like tablespaces, extensions, <br>
replication, etc.<br>
<br>
<br>
&gt; <br>
&gt; Regards,<br>
&gt; Koen De Groote<br>
&gt; <br>
<br>
-- <br>
Adrian Klaver<br>
<a href=3D"mailto:adrian.klaver@aklaver.com" target=3D"_blank">adrian.klave=
r@aklaver.com</a><br>
<br>
</blockquote></div>

--00000000000039b5fc0624eee06e--