public inbox for [email protected]
help / color / mirror / Atom feedFrom: Thomas Munro <[email protected]>
To: Nazir Bilal Yavuz <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: Jonathan S. Katz <[email protected]>
Cc: pgsql-hackers <[email protected]>
Subject: Re: Trying out read streams in pgvector (an extension)
Date: Wed, 12 Nov 2025 23:47:16 +1300
Message-ID: <CA+hUKGL7-Dx8KiUo=G91Y5tfFpwDUFFQJ6=9D8Gr1n=DZxGh+w@mail.gmail.com> (raw)
In-Reply-To: <CAN55FZ16TEhgYbK=qSEbkO8utz+u232NksCEmJMC1G4iZvnbvA@mail.gmail.com>
References: <CA+hUKGJ_7NKd46nx1wbyXWriuZSNzsTfm+rhEuvU6nxZi3-KVw@mail.gmail.com>
<[email protected]>
<CA+hUKG+x2BcqWzBC77cN0ewhzMF0kYhC6c4G_T2gJLPbqYQ6Ow@mail.gmail.com>
<CA+hUKGL-3mBtkA9RTbLFHuSS5cviuv0ko7nBhCg9KM7Q-GSEkw@mail.gmail.com>
<CAAKRu_ZVxzwRRbxedgb_LtkFaGf78XAbTO9uExvadV2DzaE=Jg@mail.gmail.com>
<CA+hUKG+zLmkD9zus=JOjjC+j5p9R1+CSXNZgd5=exZ01ZTaKoA@mail.gmail.com>
<CA+hUKGJx6FNqzsxfSOGH0nJZJq1MBc+t7NBKtAmy6zj4HD86tA@mail.gmail.com>
<CAN55FZ16TEhgYbK=qSEbkO8utz+u232NksCEmJMC1G4iZvnbvA@mail.gmail.com>
On Wed, Nov 12, 2025 at 9:04 PM Nazir Bilal Yavuz <[email protected]> wrote:
> On Wed, 12 Nov 2025 at 07:12, Thomas Munro <[email protected]> wrote:
> 0002:
>
> + /* End-of-stream. */
> + buf = read_stream_next_buffer(stream, NULL);
> + Assert(buf == InvalidBuffer);
> + buf = read_stream_next_buffer(stream, NULL);
> + Assert(buf == InvalidBuffer);
>
> I noticed there are two 'read_stream_next_buffer()' and
> 'InvalidBuffer' checks. Does having both provide any additional
> validation? I tried removing one of them, and the test still passed.
I wanted to demonstrate that this is a state that the stream is stuck
in until you call _resume().
I suppose an alternative design would be that _next_buffer() returns
InvalidBuffer only once (= the block number callback returns
InvalidBlock once) and then automatically resumes (= it restores the
distance) and then you can call read_stream_next_buffer() again (= the
block number callback will be called again to fill the stream up with
new buffers before waiting for the first one to be ready to give to
you if it isn't already). That would have the advantage of not
requiring a new function at all and make the patch even shorter, but I
don't know, I guess I thought that would be a bit more fragile in some
way, less explicit. Hmm, would it actually be better if it worked
like that?
> Also, there is one thing I wanted to clarify about the
> 'read_stream_resume()'. If 'read_stream_next_buffer()' returns an
> 'InvalidBuffer', then we can use 'read_stream_resume()' alone because
> we know that we already consumed all buffers in the stream. For the
> rest, we need to use 'read_stream_resume()' together with the
> 'read_stream_reset()', right?
For the rest, there would be no need to call read_stream_resume().
To recap the uses of read_stream_reset(): the original purpose was to
release any buffers (pins) that the stream is holding internally
because of look-ahead, and put it back to its original state, ready to
be reused. It is called (1) by read_stream_end() as an implementation
detail (eg if a LIMIT or anything else except ERROR/FATAL ends your
query early, we need to unpin buffers queued in the stream before we
pfree it), (2) explicitly by rescans, (3) in hypothetical code I
thought about that would want to stream blocks speculatively and then
change its mind after predicting incorrectly (I had a few patches like
that, abandoned for now), and then (4) in this case, by places that
temporarily ran out of block numbers, but will have some more again
soon and want to resume the stream.
It was already debatable whether heuristic state like lookahead
distance should survive acoss rescans, or in other words, whether the
expected I/O requirements of the previous scan are a useful prediction
of the requirements of the next scan, but the answer is clearer in
case (4), hence the desire to find a way to separate that use case
from the others.
view thread (18+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected]
Subject: Re: Trying out read streams in pgvector (an extension)
In-Reply-To: <CA+hUKGL7-Dx8KiUo=G91Y5tfFpwDUFFQJ6=9D8Gr1n=DZxGh+w@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox