Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1v9MTr-00GhL0-1k for pgsql-hackers@arkaria.postgresql.org; Thu, 16 Oct 2025 11:49:14 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1v9MTp-00CNKo-VT for pgsql-hackers@arkaria.postgresql.org; Thu, 16 Oct 2025 11:49:13 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1v9MTp-00CNKe-DO for pgsql-hackers@lists.postgresql.org; Thu, 16 Oct 2025 11:49:12 +0000 Received: from mail-pl1-x630.google.com ([2607:f8b0:4864:20::630]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1v9MTm-0025Ub-0R for pgsql-hackers@lists.postgresql.org; Thu, 16 Oct 2025 11:49:11 +0000 Received: by mail-pl1-x630.google.com with SMTP id d9443c01a7336-27d4d6b7ab5so8052665ad.2 for ; Thu, 16 Oct 2025 04:49:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760615349; x=1761220149; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=bnK+qm9CAZoOoMQy5IgwZ5ayyBhp8lrE55XvdP0HTUg=; b=X3ur6t4lQAGsIO1JL7/MFqjFnfvQzZgdbkscIjEILRo9U1rIhJMAY7joW6DM8qbCN7 c2VrX4CFVSl9Fn3n3gZJ3xl8IZyArPk+DoBbz0SY+QshuGCOJRGzrKCYiNaAcstknj3U LP4QbW+3vJ4md75W4eZVm8wVhZbVWxzM4HVGKafpYuSZ0QT32OfFbQKaRq5/mQ+nnPp9 f+f4msTAJVrJganFmOFVvXx7D7rEokzHyt0NAIpkHdq2GTAZuJ5naBr5rNxpaGMJzPVe rztHILBHe6vfquvWl1es2OO3QwkIs8C8PRXzjzMRub58JY0lJX57ts4afSX4xlSVugXo tjyA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760615349; x=1761220149; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bnK+qm9CAZoOoMQy5IgwZ5ayyBhp8lrE55XvdP0HTUg=; b=dIvp/azShrclth7fICqAEavSxLUD79RZuCF247RTr2WcbGowaF/XHvG6D+hFs255bg C4AJNSzhQKPjvzy07lH6K5sVuYk/vY4TqhmoC+nRzmB6NQwZSFOcCGY8OU9+4YlOJfq7 jt3QEkPV3tsULL8B21vv13rnHBsehHjFMTrUjZTgnZ075pG+aalKW1pPwx2uYbX1MdBs gCSn+7ulr1Ta2PWEm26U8FINe0nTVK+MCnV8P2pMt8eTos376G1cAWYkcZQiNDpOqKjR BK3VVwUeAgtHVBEZ0HeusFsSsq6Vk9t9j84/PCF8EPBqXfVaNRVyuCA0Ij0S4PFYZIi6 lDjA== X-Gm-Message-State: AOJu0Yzz43RpMfG2bL5o1zlss9ks9K5wWeWgkTDXzJEXNlCAI48tlrvw wkVrlOFgq7kJybBquP51WipGd37b230Ny8PGuVckT7pCCkc/EgifXSUR4/XTbj4bAIJPH7C42kK l2TLJ5aKH28JJFdI3POAdRSd+5j/3Oqk= X-Gm-Gg: ASbGncvMLQ9Xo7Nis6ZZQNmPR4f+Upk+Suq0pdUcYJ6eRthhuo4/QLjoN2AJUBjkUZR /IfWciZQTvOPXzlhPRT5pmJ3M3BoMYEYeGKmZx2pZjncMYFmDl0ymLwKOCwqasPWQog/nqsgqK0 jW63XKyt1ejkxa2TUDUTliRfssYELLR8W4jDoZEh7IXyGsQxbHLIUQ381hO1svAhnbgSOWrje6u w1Xgnzk2dQ6PsjDZOncqtkOKEBz8LTJ92jIG9WvmXS2vNpd+nRv+DYIfHOufMo4oK2v20qXyN8+ M2I3KHc= X-Google-Smtp-Source: AGHT+IH3dzIFFdi3Q+kzGfEDxiGn6NwSkYHG3/8ZWgBMxPj+1Tk8/lErldWjT5ipYsfTq4b6uQXEZYBe5ESjaX2u/4A= X-Received: by 2002:a17:903:1b0b:b0:25c:d4b6:f117 with SMTP id d9443c01a7336-290272c1ab0mr432872985ad.35.1760615348817; Thu, 16 Oct 2025 04:49:08 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Amul Sul Date: Thu, 16 Oct 2025 17:18:31 +0530 X-Gm-Features: AS18NWCKhOaoWjkBEzxWgP6hLWAEnATptdyTiv41BMbTTBCBoC_zMKMsLW2DLnM Message-ID: Subject: Re: pg_waldump: support decoding of WAL inside tarfile To: Robert Haas Cc: PostgreSQL Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Fri, Oct 10, 2025 at 11:32=E2=80=AFPM Robert Haas wrote: > > On Mon, Sep 29, 2025 at 12:17=E2=80=AFPM Amul Sul wro= te: > > While reusing the buffered data > > from the first iteration is technically possible, that only works if > > the desired start LSN is at the absolute beginning of the archive, or > > later in the sequence, which cannot be reliably guaranteed. > > I spent a bunch of time studying this code today and I think that the > problem you're talking about here is evidence of a design problem with > astreamer_wal_read() and some of the other code in > astreamer_waldump.c. Your code calls astreamer_wal_read() when it > wants to peek at the first xlog block to determine the WAL segment > size, and it also calls astreamer_wal_read() when it wants read WAL > sequentially beginning at the start LSN and continuing until it > reaches the end LSN. However, these two cases have very different > requirements. verify_tar_archive(), which is misleadingly named and > really exists to determine the WAL segment size, just wants to read > the first xlog block that physically appears in the archive. Every > xlog block will have the same WAL segment size, so it does not matter > which one we read. On the other hand, TarWALDumpReadPage wants to read > WAL in sequential order. In other words, one call to > astreamer_wal_read() really wants to read a block without any block > reordering, and the other call wants to read a block with block > reordering. > > To me, it looks like the problem here is that the block reordering > functionality should live on top of the astreamer, not inside of it. > Imagine that astreamer just spits out the bytes in the order in which > they physically appear in the archive, and then there's another > component that consumes and reorders those bytes. So, you read data > and push it into the astreamer until the number of bytes in the output > buffer is at least XLOG_BLCKSZ, and then from there you extract the > WAL segment size. Then, you call XLogReaderAllocate() and enter the > main loop. The reordering logic lives inside of TarWALDumpReadPage(). > Each time it gets data from the astreamer's buffer, it either returns > it to the caller if it's in order or buffers it using temporary files > if not. > I initially considered implementing the reordering logic outside of astreamer when we first discussed this project, but the implementation could get complicated -- or at least feel hacky. Let me explain why: astreamer reads the archive in fixed-size chunks (here it is 128KB). Sometimes, a single read can contain data from two WAL files -- specifically, the tail end of one file and the start of the next -- because of how they=E2=80=99re physically stored in the archive. astreamer knows where one file ends and another begins through tags like ASTREAMER_MEMBER_HEADER, ASTREAMER_MEMBER_CONTENTS, and ASTREAMER_MEMBER_TRAILER. However, it can=E2=80=99t pause mid-chunk to hold data from the next file once the previous one ends and for the caller; it pushes the entire chunk it has read to the target buffer. So, if we put the reordering logic outside the streamer, we=E2=80=99d sometimes be receiving buffers containing mixed data from two WAL files. The caller would then need to correctly identify WAL file boundaries within those buffers. This would require passing extra metadata -- like segment numbers for the WAL files in the buffer, plus start and end offsets of those segments within the buffer. While not impossible, it feels a bit hacky and I'm unsure if that=E2=80=99s the best approach. > I found it's actually quite easy to write a patch that avoids > reopening the file. Here it is, on top of your v4: > > diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_wald= ump.c > index 2c42df46d43..c4346a5e211 100644 > --- a/src/bin/pg_waldump/pg_waldump.c > +++ b/src/bin/pg_waldump/pg_waldump.c > @@ -368,17 +368,8 @@ init_tar_archive_reader(XLogDumpPrivate *private, > const char *waldir, > XLogRecPtr startptr, XLogRecPtr endptr, > pg_compress_algorithm compression) > { > - int fd; > astreamer *streamer; > > - /* Open tar archive and store its file descriptor */ > - fd =3D open_file_in_directory(waldir, private->archive_name); > - > - if (fd < 0) > - pg_fatal("could not open file \"%s\"", private->archive_name); > - > - private->archive_fd =3D fd; > - > /* > * Create an appropriate chain of archive streamers for reading the = given > * tar archive. > @@ -1416,12 +1407,22 @@ main(int argc, char **argv) > /* we have everything we need, start reading */ > if (is_tar) > { > + /* Open tar archive and store its file descriptor */ > + private.archive_fd =3D > + open_file_in_directory(waldir, private.archive_name); > + if (private.archive_fd < 0) > + pg_fatal("could not open file \"%s\"", private.archive_name)= ; > + > /* Verify that the archive contains valid WAL files */ > waldir =3D waldir ? pg_strdup(waldir) : pg_strdup("."); > init_tar_archive_reader(&private, waldir, InvalidXLogRecPtr, > InvalidXLogRecPtr, compression); > verify_tar_archive(&private); > - free_tar_archive_reader(&private); > + astreamer_free(private.archive_streamer); > + > + if (lseek(private.archive_fd, 0, SEEK_SET) !=3D 0) > + pg_log_error("could not seek in file \"%s\": %m", > + private.archive_name); > > /* Set up for reading tar file */ > init_tar_archive_reader(&private, waldir, private.startptr, > > Of course, this is not really what we want to do: it avoids reopening > the file, but because we can't back up the archive streamer once it's > been created, we have to lseek back to the beginning of the file. But > notice how silly this looks: with this patch, we free the archive > reader and immediately create a new archive reader that is exactly the > same in every way except that we call astreamer_waldump_new(startptr, > endptr, private) instead of astreamer_waldump_new(InvalidXLogRecPtr, > InvalidXLogRecPtr, private). We could arrange to update the original > archive streamer with new values of startSegNo and endSegNo after > verify_tar_archive(), but that's still not quite good enough, because > we might have already made some decisions on what to do with the data > that we read that it's too late to reverse. But, what that means is > that the astreamer_waldump machinery is not smart enough to read one > block of data without making irreversible decisions from which we > can't recover without recreating the entire object. I think we can, > and should, try to do better. > Agreed. > It's also worth noting that the unfortunate layering doesn't just > require us to read the first block of the file: it also complicates > the code in various places. The fact that astreamer_wal_read() needs a > special case for XLogRecPtrIsInvalid(recptr) is a direct result of > this problem, and the READ_ANY_WAL() macro and both the places that > test it are also direct results of this problem. In other words, I'm > arguing that astreamer_wal_read() is incorrectly defined, and that > error creates ugliness in the code both above and below > astreamer_wal_read(). > > While I'm on the topic of astreamer_wal_read(), here are a few other > problems I noticed: > > * The return value is not documented, and it seems to always be count, > in which case it might as well return void. The caller already has the > value they passed for count. The caller will be xlogreader, and I believe we shouldn't change that. For the same reason, WALDumpReadPage() also returns the same. > * It seems like it would be more appropriate to assert that endPtr >=3D > len and just set startPtr =3D endPtr - len. I don't see how len > endPtr > can ever happen, and I bet bad things will happen if it does. > * "pg_waldump never ask the same" -> "pg_waldump never asks for the same" > Ok. > Also, this is absolutely not OK with me: > > /* Fetch more data */ > if (astreamer_archive_read(privateInfo) =3D=3D 0) > { > char fname[MAXFNAMELEN]; > XLogSegNo segno; > > XLByteToSeg(targetPagePtr, segno, WalSegS= z); an > XLogFileName(fname, > privateInfo->timeline, segno, WalSegSz); > > pg_fatal("could not find file \"%s\" > in \"%s\" archive", > fname, > privateInfo->archive_name); > } > > astreamer_archive_read() will return 0 if we reach the end of the > tarfile, so this is saying that if we reach the end of the tar file > without finding the range of bytes for which we're looking, the > explanation must be that the relevant WAL file is missing from the > archive. But that is way too much action at a distance. I was able to > easily construct a counterexample by copying the first 81920 bytes of > a valid WAL file and then doing this: > > [robert.haas pgsql-meson]$ tar tf pg_wal.tar > 000000010000000000000005 > [robert.haas pgsql-meson]$ pg_waldump -s 0/050008D8 -e 0/05FFED98 > pg_wal.tar >/dev/null > pg_waldump: error: could not find file "000000010000000000000005" in > "pg_wal.tar" archive > > Without the redirection to /dev/null, what happened was that > pg_waldump printed out a bunch of records from > 000000010000000000000005 and then said that 000000010000000000000005 > could not be found, which is obviously silly. But the fact that I > found a specific counterexample here isn't even really the point. The > point is that there's a big gap between what we actually know at this > point (which is that we've read the whole input file) and what the > message is claiming (which is that the reason must be that the file is > missing from the archive). Even if the counterexample above didn't > exist and that really were the only way for that to happen as of > today, that's very fragile. Maybe some future code change will make it > so that there's a second reason that could happen. How would somebody > realize that they had created a second condition by means of which > this code could be reached? If they did realize it, how would they get > the correct error to be reported? > Agreed, I'll think about this. > > /* Continue reading from the open WAL segment, if any */ > if (state->seg.ws_file >=3D 0) > { > /* > * To prevent a race condition where the archive streamer is stil= l > * exporting a file that we are trying to read, we invoke the str= eamer > * to ensure enough data is available. > */ > if (private->curSegNo =3D=3D state->seg.ws_segno) > astreamer_archive_read(private); > > return WALDumpReadPage(state, targetPagePtr, reqLen, targetPtr, > readBuff); > } > > But it's unclear why this should be good enough to ensure that enough > data is available. astreamer_archive_read() might read zero bytes and > return 0, so this doesn't really guarantee anything at all. On the > other hand, even if astereamer_archive_read() returns a non-zero > value, it's only going to read READ_CHUNK_SIZE bytes from the > underlying file, so if more than that needs to be read in order for us > to have enough data, we won't. I think it's very hard to imagine a > situation in which you can call astreamer_archive_read() without using > some loop. The loop isn't needed because the caller always requests 8KB of data, while READ_CHUNK_SIZE is 128KB. It=E2=80=99s assumed that the astreamer has already created the file with some initial data. For example, if only a few bytes have been written so far, when we reach TarWALDumpReadPage(), it detects that we=E2=80=99re reading the same file that the astreamer is still writing to and hasn=E2=80=99t finished. It then= request to appends 128KB of data by calling astreamer_archive_read, even though we only need 8KB at a time. This process repeats each time the next 8KBchunk i= s requested: astreamer_archive_read() appends another 128KB,and continues unt= il the file has been fully read and written. > That's what astreamer_wal_read() does: it calls > astreamer_archive_read() until it either returns 0 -- in which case we > know we've failed -- or until we have enough data. Here we just hope > that calling it once is enough, and that checking for errors is > unimportant. I also don't understand the reference to a race > condition, because there's only one process with one thread here, I > believe, so what would be racing against? > In the case where the astreamer is exporting a file to disk but hasn=E2=80= =99t finished writing it, and we call TarWALDumpReadPage() to request block(s) from that WAL file, we can read only up to the existing blocks in the file. Since the file is incomplete, reading may fail later. To handle this, astreamer_archive_read() is invoked to append more data -- usually more than the requested amount, as explained earlier. That is the race condition I am trying to handle. Now, regarding the concern of astreamer_archive_read() returning zero without reading or appending any data: this can happen only if the WAL is shorter than expected -- an incomplete. In that case, WALDumpReadPage() will raise the appropriate error, we don't have to check at that point, I think. > Another thing I noticed is that astreamer_archive_read() makes > reference to decrypting, but there's no cryptography involved in any > of this. > I think that was a typo -- I meant decompression. Regards, Amul