Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1t2d61-005heU-Vy for pgsql-general@arkaria.postgresql.org; Sun, 20 Oct 2024 21:04:18 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1t2d5w-001VHh-Vq for pgsql-general@arkaria.postgresql.org; Sun, 20 Oct 2024 21:04:13 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1t2d5w-001VHZ-Fz for pgsql-general@lists.postgresql.org; Sun, 20 Oct 2024 21:04:12 +0000 Received: from mail-ej1-x629.google.com ([2a00:1450:4864:20::629]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1t2d5q-0026vl-IS for pgsql-general@lists.postgresql.org; Sun, 20 Oct 2024 21:04:12 +0000 Received: by mail-ej1-x629.google.com with SMTP id a640c23a62f3a-a9a4031f69fso549571266b.0 for ; Sun, 20 Oct 2024 14:04:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729458243; x=1730063043; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=val3hBBGkuJT3BG2IGYoMVw/D4KGJOSvzJ10Fj89S6w=; b=PePCxPVxIQrNFZN2ebp/HBUuNfB58wZjU7roPNmj48pnUCPqwEngvGQH07tV7SSkmH bU9wh9QANefZ29hOT464Ze89JCuVi+I/QDhLMYL5PQPYD1wFdgZ6rlfdydN6/Pg0s93X ZnjgunoK4Cqv2M8K0modT4KdJqQbTkhvfMKdr+bQSUsIezPhBltKRTQjFrHo1zURiK01 2W2SZp+dr92ZQ0rY9EbV/YNYT95Lz9iYpbW+QRGM2x2MkgkSXXPjiIc1EA0qyVM0T0OV i9TLKZQFCYaBIdwFcVQxOOHaP68qqMBMs+wR09UKeSf+OSWROaEOZ78SiqGNKzpgwr9C qNhw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729458243; x=1730063043; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=val3hBBGkuJT3BG2IGYoMVw/D4KGJOSvzJ10Fj89S6w=; b=O5a5YuxalQdlI5qFGaBRf2w/4MYAeBZW2yySX7Vjsnt+9jTVpund6T9S93akUl08k6 Jyi5/bxDLtEy++UwvBzpu/NQnu2I9zN3uUu54lsCwMEqNI3ZvWDmUrucOvP5KIvbE6A0 Mfv+mnFG+hPNTART1NBD5CJVYK3Yl0ZNkD1L22ei46+cHmEN+Ilej/eTrFA8fl/vZddZ ZOTmAn+zDqU13kHax4gdshyQLc/vDH9qYlCaMkD3XMvO+EPD/QBege6oPMjBuBl0LXa8 /eZrlZdkDN3cNhBAdeYuCStsMxppNxmfclC1zxcXoSM3K26DcB9TxwPKVgM7imreZK8Y NDYQ== X-Gm-Message-State: AOJu0YwJQo4hA252Ksg/uZ+KN9R9oIq/ZXQUHHmAqSRnnRW9fAeXJ+VE AEjAbWIXCw356dxdbqc+IetSuyB9rpHa/G0fBLkDElJk+RqeJiu4Yq0NpsZZbavbMnaqcdHdjFd EUHGuH2uXf5XueVAz8lIRXVA6fCwdE3Ar X-Google-Smtp-Source: AGHT+IGnyevZxZZTxffRsl4xfJsudILSIB1puHclT1+Leu6RtUdgLU2n+q0pkPspt1C0UugkdAG/h21Xhg5Nu3AyyCg= X-Received: by 2002:a17:907:97cb:b0:a9a:33c:f6e4 with SMTP id a640c23a62f3a-a9a69c9eb72mr957319766b.40.1729458242956; Sun, 20 Oct 2024 14:04:02 -0700 (PDT) MIME-Version: 1.0 References: <52afa4c9-7393-4265-88bb-6393f1b0fb03@aklaver.com> <45e8c44b-2506-41b5-b999-5fdc42472644@aklaver.com> In-Reply-To: <45e8c44b-2506-41b5-b999-5fdc42472644@aklaver.com> From: Koen De Groote Date: Sun, 20 Oct 2024 23:03:51 +0200 Message-ID: Subject: Re: Basebackup fails without useful error message To: Adrian Klaver Cc: PostgreSQL General Content-Type: multipart/alternative; boundary="00000000000039b5fc0624eee06e" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000039b5fc0624eee06e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello Adrian, and everyone else. It has finally happened, the backup ran into an error again, and the verbose output set me on the right path. I'm getting this error message: > pg_basebackup: could not receive data from WAL stream: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. Combined with the main server logging: > terminating walsender process due to replication timeout Now, the server is set up with an archive_command which gzips the WAL files and writes them to a network filesystem. From looking at machine metrics at the time, my conclusion is the following= : At the time of the error, the remote filesystem experienced a very high queue size for new writes. So I'm assuming the process of writing WAL files, if there is an archive_command set, is only considered to be finished after the archive is written, not just when the WAL file is written in pg_wal. I'm also seeing in the documentation that the default WAL method for pg_basebackup is "stream", which waits for these WAL files as they are produced. I suspect that I have 2 possible paths at this point: 1: increase wal_sender_timeout 2: run the basebackup with --wal-method=3Dnone since my restore_command is set up to explicitly go to the very same network storage to get the archived WAL files. I'm going to be testing this. If someone could confirm that this is how writing WAL files works, that being: that it is only considered "done" when the archive_command is done, that would be great. Regards, Koen De Groote On Sun, Sep 29, 2024 at 6:08=E2=80=AFPM Adrian Klaver wrote: > On 9/29/24 08:57, Koen De Groote wrote: > > > What is the complete command you are using? > > > > The full command is: > > > > pg_basebackup -h localhost -p 5432 -U basebackup_user -D > > /mnt/base_backup/dir -Ft -z -P > > > > So output Format as tar, gzipped, and with progress being printed. > > > > > Have you looked at the Postgres log? > > > > > Is --verbose being used? > > > > This is straight from the logs, it's the only output besides the % > > progress counter. > > > > Will have a look at --verbose. > > When you report on that and if it does not report the error then what is?= : > > Postgres version. > > OS and version. > > Anything special about the cluster like tablespaces, extensions, > replication, etc. > > > > > > Regards, > > Koen De Groote > > > > -- > Adrian Klaver > adrian.klaver@aklaver.com > > --00000000000039b5fc0624eee06e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello Adrian, and everyone else.

It has= finally happened, the backup ran into an error again, and the verbose outp= ut set me on the right path.

I'm getting this = error message:

> pg_basebackup: could not receive data from WAL= stream: server closed the connection unexpectedly
> This probably me= ans the server terminated abnormally
> before or while processing the= request.

Combined with the main server logging:

> terminating walsender process due to replication tim= eout

Now, the server is set up with an archive_comma= nd which gzips the WAL files and writes them to a network filesystem.
=

From looking at machine metrics at the time, my conclus= ion is the following:

At the time of the error, th= e remote filesystem experienced a very high queue size for new writes.

So I'm assuming the process of writing WAL files, = if there is an archive_command set, is only considered to be finished after= the archive is written, not just when the WAL file is written in pg_wal.

I'm also seeing in the documentation that the d= efault WAL method for pg_basebackup is "stream", which waits for = these WAL files as they are produced.

I suspect th= at I have 2 possible paths at this point:

1: incre= ase=C2=A0wal_sender_timeout
2: run the basebackup with=C2=A0--wal= -method=3Dnone since my restore_command is set up to explicitly go to the v= ery same network storage to get the archived WAL files.

I'm going to be testing this. If someone could confirm that this = is how writing WAL files works, that being: that it is only considered &quo= t;done" when the archive_command is done, that would be great.

Regards,
Koen De Groote


On Sun, Sep 29, 2024 at 6:08=E2=80=AFPM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
=
On 9/29/24 08:57, K= oen De Groote wrote:
>=C2=A0 > What is the complete command you are using?
>
> The full command is:
>
> pg_basebackup -h localhost -p 5432 -U basebackup_user -D
> /mnt/base_backup/dir -Ft -z -P
>
> So output Format as tar, gzipped, and with progress being printed.
>
>=C2=A0 > Have you looked at the Postgres log?
>
>=C2=A0 > Is --verbose being used?
>
> This is straight from the logs, it's the only output besides the %=
> progress counter.
>
> Will have a look at --verbose.

When you report on that and if it does not report the error then what is?:<= br>
Postgres version.

OS and version.

Anything special about the cluster like tablespaces, extensions,
replication, etc.


>
> Regards,
> Koen De Groote
>

--
Adrian Klaver
adrian.klave= r@aklaver.com

--00000000000039b5fc0624eee06e--