MIME-Version: 1.0
References: 
 <CAAJ_b94bqdWN3h2J-PzzzQ2Npbwct5ZQHggn_QoYGhC2rn-=WQ@mail.gmail.com>
 <CAAJ_b97PQjE4kFD8Qk6UvtLrfPMixw1nxBz0OP5Z2WB2B-uMxQ@mail.gmail.com>
 <CAAJ_b97JA8ehy_UDddrnGwDt9HG5NmJq8ATtmeMqo7YD-=tLyQ@mail.gmail.com>
 <CA+TgmoZjhWDG_AR1i+L1yss-wbuWvxrdRwSdVUUUnVPrJV2CnQ@mail.gmail.com>
 <CAAJ_b94Uh+b41LQG45bZFK+i62EVvv972LiGWWWuR64=-64rTQ@mail.gmail.com>
 <CA+TgmobF5c7ZcZHdEhqwNxGDZzWG2bDtpRaDtoVELWX_VHs_1A@mail.gmail.com>
 <CAAJ_b94gK1np8d1h-2c1YoCccGXr4zspTa-FC7X_bfXZNz=-DA@mail.gmail.com>
 <CA+TgmoayDY5b+bP1vRRN7A3xOP-=+tK13B2C1g-Xm1j4WTrT9Q@mail.gmail.com>
In-Reply-To: 
 <CA+TgmoayDY5b+bP1vRRN7A3xOP-=+tK13B2C1g-Xm1j4WTrT9Q@mail.gmail.com>
From: Amul Sul <sulamul@gmail.com>
Date: Thu, 16 Oct 2025 17:18:31 +0530
Message-ID: 
 <CAAJ_b97JAF+Zuoh2FBO79hVwLeaBPwsbXw-fY+313a7LfRQ-Bg@mail.gmail.com>
Subject: Re: pg_waldump: support decoding of WAL inside tarfile
To: Robert Haas <robertmhaas@gmail.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAAJ_b97JAF%2BZuoh2FBO79hVwLeaBPwsbXw-fY%2B313a7LfRQ-Bg%40mail.gmail.com>
Precedence: bulk

On Fri, Oct 10, 2025 at 11:32=E2=80=AFPM Robert Haas <robertmhaas@gmail.com=
> wrote:
>
> On Mon, Sep 29, 2025 at 12:17=E2=80=AFPM Amul Sul <sulamul@gmail.com> wro=
te:
> > While reusing the buffered data
> > from the first iteration is technically possible, that only works if
> > the desired start LSN is at the absolute beginning of the archive, or
> > later in the sequence, which cannot be reliably guaranteed.
>
> I spent a bunch of time studying this code today and I think that the
> problem you're talking about here is evidence of a design problem with
> astreamer_wal_read() and some of the other code in
> astreamer_waldump.c. Your code calls astreamer_wal_read() when it
> wants to peek at the first xlog block to determine the WAL segment
> size, and it also calls astreamer_wal_read() when it wants read WAL
> sequentially beginning at the start LSN and continuing until it
> reaches the end LSN. However, these two cases have very different
> requirements. verify_tar_archive(), which is misleadingly named and
> really exists to determine the WAL segment size, just wants to read
> the first xlog block that physically appears in the archive. Every
> xlog block will have the same WAL segment size, so it does not matter
> which one we read. On the other hand, TarWALDumpReadPage wants to read
> WAL in sequential order. In other words, one call to
> astreamer_wal_read() really wants to read a block without any block
> reordering, and the other call wants to read a block with block
> reordering.
>
> To me, it looks like the problem here is that the block reordering
> functionality should live on top of the astreamer, not inside of it.
> Imagine that astreamer just spits out the bytes in the order in which
> they physically appear in the archive, and then there's another
> component that consumes and reorders those bytes. So, you read data
> and push it into the astreamer until the number of bytes in the output
> buffer is at least XLOG_BLCKSZ, and then from there you extract the
> WAL segment size. Then, you call XLogReaderAllocate() and enter the
> main loop. The reordering logic lives inside of TarWALDumpReadPage().
> Each time it gets data from the astreamer's buffer, it either returns
> it to the caller if it's in order or buffers it using temporary files
> if not.
>

I initially considered implementing the reordering logic outside of
astreamer when we first discussed this project, but the implementation
could get complicated -- or at least feel hacky. Let me explain why:

astreamer reads the archive in fixed-size chunks (here it is 128KB).
Sometimes, a single read can contain data from two WAL files --
specifically, the tail end of one file and the start of the next --
because of how they=E2=80=99re physically stored in the archive. astreamer
knows where one file ends and another begins through tags like
ASTREAMER_MEMBER_HEADER, ASTREAMER_MEMBER_CONTENTS, and
ASTREAMER_MEMBER_TRAILER. However, it can=E2=80=99t pause mid-chunk to hold
data from the next file once the previous one ends and for the caller;
it pushes the entire chunk it has read to the target buffer.

So, if we put the reordering logic outside the streamer, we=E2=80=99d
sometimes be receiving buffers containing mixed data from two WAL
files. The caller would then need to correctly identify WAL file
boundaries within those buffers. This would require passing extra
metadata -- like segment numbers for the WAL files in the buffer, plus
start and end offsets of those segments within the buffer. While not
impossible, it feels a bit hacky and I'm unsure if that=E2=80=99s the best
approach.

> I found it's actually quite easy to write a patch that avoids
> reopening the file. Here it is, on top of your v4:
>
> diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_wald=
ump.c
> index 2c42df46d43..c4346a5e211 100644
> --- a/src/bin/pg_waldump/pg_waldump.c
> +++ b/src/bin/pg_waldump/pg_waldump.c
> @@ -368,17 +368,8 @@ init_tar_archive_reader(XLogDumpPrivate *private,
> const char *waldir,
>                          XLogRecPtr startptr, XLogRecPtr endptr,
>                          pg_compress_algorithm compression)
>  {
> -    int         fd;
>      astreamer  *streamer;
>
> -    /* Open tar archive and store its file descriptor */
> -    fd =3D open_file_in_directory(waldir, private->archive_name);
> -
> -    if (fd < 0)
> -        pg_fatal("could not open file \"%s\"", private->archive_name);
> -
> -    private->archive_fd =3D fd;
> -
>      /*
>       * Create an appropriate chain of archive streamers for reading the =
given
>       * tar archive.
> @@ -1416,12 +1407,22 @@ main(int argc, char **argv)
>      /* we have everything we need, start reading */
>      if (is_tar)
>      {
> +        /* Open tar archive and store its file descriptor */
> +        private.archive_fd =3D
> +            open_file_in_directory(waldir, private.archive_name);
> +        if (private.archive_fd < 0)
> +            pg_fatal("could not open file \"%s\"", private.archive_name)=
;
> +
>          /* Verify that the archive contains valid WAL files */
>          waldir =3D waldir ? pg_strdup(waldir) : pg_strdup(".");
>          init_tar_archive_reader(&private, waldir, InvalidXLogRecPtr,
>                                  InvalidXLogRecPtr, compression);
>          verify_tar_archive(&private);
> -        free_tar_archive_reader(&private);
> +        astreamer_free(private.archive_streamer);
> +
> +        if (lseek(private.archive_fd, 0, SEEK_SET) !=3D 0)
> +            pg_log_error("could not seek in file \"%s\": %m",
> +                         private.archive_name);
>
>          /* Set up for reading tar file */
>          init_tar_archive_reader(&private, waldir, private.startptr,
>
> Of course, this is not really what we want to do: it avoids reopening
> the file, but because we can't back up the archive streamer once it's
> been created, we have to lseek back to the beginning of the file. But
> notice how silly this looks: with this patch, we free the archive
> reader and immediately create a new archive reader that is exactly the
> same in every way except that we call astreamer_waldump_new(startptr,
> endptr, private) instead of astreamer_waldump_new(InvalidXLogRecPtr,
> InvalidXLogRecPtr, private). We could arrange to update the original
> archive streamer with new values of startSegNo and endSegNo after
> verify_tar_archive(), but that's still not quite good enough, because
> we might have already made some decisions on what to do with the data
> that we read that it's too late to reverse. But, what that means is
> that the astreamer_waldump machinery is not smart enough to read one
> block of data without making irreversible decisions from which we
> can't recover without recreating the entire object. I think we can,
> and should, try to do better.
>

Agreed.

> It's also worth noting that the unfortunate layering doesn't just
> require us to read the first block of the file: it also complicates
> the code in various places. The fact that astreamer_wal_read() needs a
> special case for XLogRecPtrIsInvalid(recptr) is a direct result of
> this problem, and the READ_ANY_WAL() macro and both the places that
> test it are also direct results of this problem. In other words, I'm
> arguing that astreamer_wal_read() is incorrectly defined, and that
> error creates ugliness in the code both above and below
> astreamer_wal_read().
>
> While I'm on the topic of astreamer_wal_read(), here are a few other
> problems I noticed:
>
> * The return value is not documented, and it seems to always be count,
> in which case it might as well return void. The caller already has the
> value they passed for count.

The caller will be xlogreader, and I believe we shouldn't change that.
For the same reason, WALDumpReadPage() also returns the same.

> * It seems like it would be more appropriate to assert that endPtr >=3D
> len and just set startPtr =3D endPtr - len. I don't see how len > endPtr
> can ever happen, and I bet bad things will happen if it does.
> * "pg_waldump never ask the same" -> "pg_waldump never asks for the same"
>

Ok.

> Also, this is absolutely not OK with me:
>
>                         /* Fetch more data */
>                         if (astreamer_archive_read(privateInfo) =3D=3D 0)
>                         {
>                                 char            fname[MAXFNAMELEN];
>                                 XLogSegNo       segno;
>
>                                 XLByteToSeg(targetPagePtr, segno, WalSegS=
z); an
>                                 XLogFileName(fname,
> privateInfo->timeline, segno, WalSegSz);
>
>                                 pg_fatal("could not find file \"%s\"
> in \"%s\" archive",
>                                                  fname,
> privateInfo->archive_name);
>                         }
>
> astreamer_archive_read() will return 0 if we reach the end of the
> tarfile, so this is saying that if we reach the end of the tar file
> without finding the range of bytes for which we're looking, the
> explanation must be that the relevant WAL file is missing from the
> archive. But that is way too much action at a distance. I was able to
> easily construct a counterexample by copying the first 81920 bytes of
> a valid WAL file and then doing this:
>
> [robert.haas pgsql-meson]$ tar tf pg_wal.tar
> 000000010000000000000005
> [robert.haas pgsql-meson]$ pg_waldump -s 0/050008D8 -e 0/05FFED98
> pg_wal.tar >/dev/null
> pg_waldump: error: could not find file "000000010000000000000005" in
> "pg_wal.tar" archive
>
> Without the redirection to /dev/null, what happened was that
> pg_waldump printed out a bunch of records from
> 000000010000000000000005 and then said that 000000010000000000000005
> could not be found, which is obviously silly. But the fact that I
> found a specific counterexample here isn't even really the point. The
> point is that there's a big gap between what we actually know at this
> point (which is that we've read the whole input file) and what the
> message is claiming (which is that the reason must be that the file is
> missing from the archive). Even if the counterexample above didn't
> exist and that really were the only way for that to happen as of
> today, that's very fragile. Maybe some future code change will make it
> so that there's a second reason that could happen. How would somebody
> realize that they had created a second condition by means of which
> this code could be reached? If they did realize it, how would they get
> the correct error to be reported?
>

Agreed, I'll think about this.

>
>     /* Continue reading from the open WAL segment, if any */
>     if (state->seg.ws_file >=3D 0)
>     {
>         /*
>          * To prevent a race condition where the archive streamer is stil=
l
>          * exporting a file that we are trying to read, we invoke the str=
eamer
>          * to ensure enough data is available.
>          */
>         if (private->curSegNo =3D=3D state->seg.ws_segno)
>             astreamer_archive_read(private);
>
>         return WALDumpReadPage(state, targetPagePtr, reqLen, targetPtr,
>                                readBuff);
>     }
>
> But it's unclear why this should be good enough to ensure that enough
> data is available. astreamer_archive_read() might read zero bytes and
> return 0, so this doesn't really guarantee anything at all. On the
> other hand, even if astereamer_archive_read() returns a non-zero
> value, it's only going to read READ_CHUNK_SIZE bytes from the
> underlying file, so if more than that needs to be read in order for us
> to have enough data, we won't. I think it's very hard to imagine a
> situation in which you can call astreamer_archive_read() without using
> some loop.

The loop isn't needed because the caller always requests 8KB of data,
while READ_CHUNK_SIZE is 128KB. It=E2=80=99s assumed that the astreamer has
already created the file with some initial data. For example, if only
a few bytes have been written so far, when we reach
TarWALDumpReadPage(), it detects that we=E2=80=99re reading the same file
that the astreamer is still writing to and hasn=E2=80=99t finished. It then=
 request to
appends 128KB of data by calling astreamer_archive_read, even though we
only need 8KB at a time. This process repeats each time the next 8KBchunk i=
s
requested: astreamer_archive_read() appends another 128KB,and continues unt=
il
the file has been fully read and written.

> That's what astreamer_wal_read() does: it calls
> astreamer_archive_read() until it either returns 0 -- in which case we
> know we've failed -- or until we have enough data. Here we just hope
> that calling it once is enough, and that checking for errors is
> unimportant. I also don't understand the reference to a race
> condition, because there's only one process with one thread here, I
> believe, so what would be racing against?
>

In the case where the astreamer is exporting a file to disk but hasn=E2=80=
=99t
finished writing it, and we call TarWALDumpReadPage() to request
block(s) from that WAL file, we can read only up to the existing
blocks in the file. Since the file is incomplete, reading may fail
later. To handle this, astreamer_archive_read() is invoked to append
more data -- usually more than the requested amount, as explained
earlier. That is the race condition I am trying to handle.

Now, regarding the concern of astreamer_archive_read() returning zero
without reading or appending any data: this can happen only if the WAL
is shorter than expected -- an incomplete. In that case,
WALDumpReadPage() will raise the appropriate error, we don't have to
check at that point, I think.

> Another thing I noticed is that astreamer_archive_read() makes
> reference to decrypting, but there's no cryptography involved in any
> of this.
>

I think that was a typo -- I meant decompression.

Regards,
Amul