Re: Trying out read streams in pgvector (an extension)

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Heikki Linnakangas <[email protected]>
To: Thomas Munro <[email protected]>
To: pgsql-hackers <[email protected]>
Subject: Re: Trying out read streams in pgvector (an extension)
Date: Tue, 11 Jun 2024 10:47:31 +0300
Message-ID: <[email protected]> (raw)
In-Reply-To: <CA+hUKGJ_7NKd46nx1wbyXWriuZSNzsTfm+rhEuvU6nxZi3-KVw@mail.gmail.com>
References: <CA+hUKGJ_7NKd46nx1wbyXWriuZSNzsTfm+rhEuvU6nxZi3-KVw@mail.gmail.com>

On 11/06/2024 07:53, Thomas Munro wrote:
> Someone involved in that project mentioned that it's probably not a
> great topic to research in practice, because real world users of HNSW
> use fully cached ie prewarmed indexes, because the performance is so
> bad otherwise.  (Though maybe that argument is a little circular...).

I think that's true in practice for *building* an HNSW index, but faster 
*searching* when the index is not in memory seems quite useful. And of 
course, faster is always better, even if it's only in a non-optimal 
scenario.

> So although this patch clearly speeds up cold HSNW searches to a
> degree controlled by effective_io_concurrency, I'll probably look for
> something else.  Suggestions for interesting index types to look at
> streamifying are very welcome!

GiST and GIN?

> Hmm.  If that's really true about HNSW though, then there may still be
> an opportunity to do automatic memory prefetching[1].  But then in the
> case of index building, "stream" is NULL in this patch anyway.  It
> surely must also be possible to find some good places to put
> profitable explicit pg_mem_prefetch() calls given the predictability
> and the need to get only ~60ns ahead for that usage.  I didn't look
> into that because I was trying to prove things about read_stream.c,
> not get involved in another project :-D
> 
> Here ends my science experiment report, which I'm dropping here just
> in case others see useful ideas here.  The main thing I learned about
> the read stream API is that it'd be nice to be able to reset the
> stream but preserve the distance (something that came up on the
> streaming sequential scan thread for a different reason), to deal with
> cases where look-ahead opportunities come in bursts but you want a
> longer lived stream than I used here.  That is the reason the patch
> creates and destroys temporary streams in a loop; doh.  It also
> provides an interesting case study for what speculative random
> look-ahead support might need to look like.

This reminds me of a prototype I wrote earlier, see 
https://github.com/pgvector/pgvector/pull/386, 1st commit. It 
reorganizes HnswSearchLayer() so that it in iteration, it first collects 
all the neighbors to visit, and then visits them, somewhat similar to 
your patch.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

view thread (2+ messages)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Trying out read streams in pgvector (an extension)
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox