Re: Trying out read streams in pgvector (an extension)

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Peter Geoghegan <[email protected]>
To: Thomas Munro <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Jonathan S. Katz <[email protected]>
Cc: pgsql-hackers <[email protected]>
Subject: Re: Trying out read streams in pgvector (an extension)
Date: Tue, 9 Dec 2025 17:38:10 -0500
Message-ID: <CAH2-Wz=b4fLaR0Ljjcnp3gvMqtRifD7ArM8KZd4JMgjVv9mtdQ@mail.gmail.com> (raw)
In-Reply-To: <CA+hUKGJLT2JvWLEiBXMbkSSc5so_Y7=N+S2ce7npjLw8QL3d5w@mail.gmail.com>
References: <CA+hUKGJ_7NKd46nx1wbyXWriuZSNzsTfm+rhEuvU6nxZi3-KVw@mail.gmail.com>
	<[email protected]>
	<CA+hUKG+x2BcqWzBC77cN0ewhzMF0kYhC6c4G_T2gJLPbqYQ6Ow@mail.gmail.com>
	<CA+hUKGL-3mBtkA9RTbLFHuSS5cviuv0ko7nBhCg9KM7Q-GSEkw@mail.gmail.com>
	<CAAKRu_ZVxzwRRbxedgb_LtkFaGf78XAbTO9uExvadV2DzaE=Jg@mail.gmail.com>
	<CA+hUKG+zLmkD9zus=JOjjC+j5p9R1+CSXNZgd5=exZ01ZTaKoA@mail.gmail.com>
	<CA+hUKGJx6FNqzsxfSOGH0nJZJq1MBc+t7NBKtAmy6zj4HD86tA@mail.gmail.com>
	<CAN55FZ16TEhgYbK=qSEbkO8utz+u232NksCEmJMC1G4iZvnbvA@mail.gmail.com>
	<CA+hUKGL7-Dx8KiUo=G91Y5tfFpwDUFFQJ6=9D8Gr1n=DZxGh+w@mail.gmail.com>
	<CAAKRu_ZGhnWZXOyEyZ2r47g-F7U8asMRA6U8YZw3h=2rR=m_hQ@mail.gmail.com>
	<CAN55FZ0tgjF1beJSRXw3rgkbzwPZ7ngChJkPZm9aJkPuaF=dmg@mail.gmail.com>
	<CAAKRu_Zwj83zCJhahhMO578-+JdfTbqMV_ktxr-XjiE8BHLo9g@mail.gmail.com>
	<CA+hUKGJLT2JvWLEiBXMbkSSc5so_Y7=N+S2ce7npjLw8QL3d5w@mail.gmail.com>

On Mon, Dec 8, 2025 at 10:47 PM Thomas Munro <[email protected]> wrote:
> Yielding just because you've scanned N index pages/tuples/whatever is
> harder to think about.  The stream shouldn't get far ahead unless it's
> recently been useful for I/O concurrency (though optimal distance
> heuristics are an open problem), but in this case a single invocation
> of the block number callback can call ReadBuffer() an arbitrary number
> of times, filtering out all the index tuples as it rampages through
> the whole index IIUC.  I see why you might want to yield periodically
> if you can, but I also wonder how much that can really help if you
> still have to pick up where you left off next time.

I think of it as a necessary precaution against pathological behavior
where the amount of memory used to cache matching tuples/TIDs gets out
of hand. There's no specific reason to expect that to happen (or no
good reason). But I'm pretty sure that it'll prove necessary to pay
non-zero attention to how much work has been done since the last time
we returned a tuple (when there's a tuple available to return).

> I guess it
> depends on the distribution of matches.

To be clear, I haven't done any kind of modelling of the problems in
this area. Once I do that (in 2026), I'll be able to say more about
the requirements. Maybe Tomas could take a look sooner?

Right now my focus is on getting the basic interfaces/API revisions in
better shape. And avoiding regressions while doing so.

-- 
Peter Geoghegan

view thread (18+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Trying out read streams in pgvector (an extension)
  In-Reply-To: <CAH2-Wz=b4fLaR0Ljjcnp3gvMqtRifD7ArM8KZd4JMgjVv9mtdQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox