Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rSZQH-00EYKw-Ky for pgsql-hackers@arkaria.postgresql.org; Wed, 24 Jan 2024 09:19:54 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rSZQG-003cJm-Mz for pgsql-hackers@arkaria.postgresql.org; Wed, 24 Jan 2024 09:19:52 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rSZQG-003cJd-CF for pgsql-hackers@lists.postgresql.org; Wed, 24 Jan 2024 09:19:52 +0000 Received: from mail-ed1-x532.google.com ([2a00:1450:4864:20::532]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1rSZQB-003Raq-5t for pgsql-hackers@lists.postgresql.org; Wed, 24 Jan 2024 09:19:51 +0000 Received: by mail-ed1-x532.google.com with SMTP id 4fb4d7f45d1cf-55a9008c185so6548758a12.1 for ; Wed, 24 Jan 2024 01:19:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1706087985; x=1706692785; darn=lists.postgresql.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=okamYYJ90zGlQTbnX+bXzskQGwqZG3Khwt1ODbOKY9A=; b=jqb/MBR5wsR0esHIjqfcAt1D9D0vA/FoqXaNPnf/K7yI4eVXnOaWL7o/pi9nLenze2 gJNSVqsPv57gRzLOEO2gmc7HVMbN+yVHbBrinQiMO8h6b5DqHL8f5feB8lUrgbBtwxPm YLRyU7/8D48w29vXG1cinvFh3ATk02m37dz/W8ZRn71KOkmIoWRGmpQ5/PmG2y+qkvi4 ud/syZLP0ZSFFtdhbZb4ceYxiVoqsRh5pbUHiqDiML6HCnh1ec9Jmy0D2xCVduRdxtgx JnnwSr4aics7Qu08GpMkHtHDZ5zePfuGY/y0P5hYKO4RAsfKdXMhJuNT9vkBSFLeUHHg t0ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706087985; x=1706692785; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=okamYYJ90zGlQTbnX+bXzskQGwqZG3Khwt1ODbOKY9A=; b=X/Ginajquvlznofx5MVKvYAVSzdfi4ZM2EpAbPfUVAsbGdBB0b2tBu+60Y6matMS6e wUMB2Jx41LmKgPgA4zqtKHYyr+W/Og/Y18QzIfdMB0BSiGTW1I8prGdDa5TJgY2dIsyQ aM5lKcu03vrrdbVL4AFtUXVZZ8rw68CZG5CACcLK+kBA7itz8DPyFsSBHaHnv88wclV9 cBPLaka9ELCyblhZRsgBGgDujlnrU7wCFu58LX6z4xaluWX4QwocNlBvQ9Zf957+PrFS /UWTzS3UTmkY5oi05rBiHij1GH73SzalgCv5RiNSekYyhmaTFhh8HJnm1wucvWu2UVHg Hnwg== X-Gm-Message-State: AOJu0YydBKgVHhfhiZalU9Ap3B7msRXLu1XeMWjbSU/5Hd5xcN+E4vvT w0CNe+uzg+yorJEZHDrhdLq3xSgaXNQ8rvpIWPyW3wRFhTWQdvl/KaCOQoHBQQ== X-Google-Smtp-Source: AGHT+IFZsyHH5UBnQ4AebNcYiJtOgCfA229MN8VIRxKoiEVfHuQbmynAP6Ekj3v7uSaNdTOzYOkm7w== X-Received: by 2002:a05:6402:22f7:b0:559:cf48:630f with SMTP id dn23-20020a05640222f700b00559cf48630fmr784329edb.35.1706087985338; Wed, 24 Jan 2024 01:19:45 -0800 (PST) Received: from [10.137.0.18] (ip-86-49-229-30.bb.vodafone.cz. [86.49.229.30]) by smtp.gmail.com with ESMTPSA id dm24-20020a05640222d800b0055c1433be60sm4102902edb.50.2024.01.24.01.19.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 24 Jan 2024 01:19:45 -0800 (PST) Message-ID: <4867452a-b853-4813-a6da-9bb06a336f8b@enterprisedb.com> Date: Wed, 24 Jan 2024 10:19:44 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Melanie Plageman Cc: Robert Haas , Andres Freund , PostgreSQL Hackers , Georgios References: <8ec36f51-b863-60e3-20e2-b9c981c5ce5e@enterprisedb.com> <367160ea-b1ed-4481-e804-bca509128878@enterprisedb.com> <280dc83c-a16f-4424-1319-95e7e3f798bd@enterprisedb.com> <98ba4b25-fae8-c1f4-1597-8093375a1986@enterprisedb.com> <20231221134314.wf2rs62d37u62j7t@alap3.anarazel.de> <20231221154352.ijtg6wloa3nowivh@alap3.anarazel.de> <482ec3ff-52ad-415d-96fd-f3832a894023@enterprisedb.com> <56176b8d-956c-487e-ab09-310db4581c07@enterprisedb.com> Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 1/24/24 01:51, Melanie Plageman wrote: > On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra > wrote: >> >> On 1/19/24 22:43, Melanie Plageman wrote: >> >>> We fill a queue with blocks from TIDs that we fetched from the index. >>> The queue is saved in a scan descriptor that is made available to the >>> streaming read callback. Once the queue is full, we invoke the table >>> AM specific index_fetch_tuple() function which calls >>> pg_streaming_read_buffer_get_next(). When the streaming read API >>> invokes the callback we registered, it simply dequeues a block number >>> for prefetching. >> >> So in a way there are two queues in IndexFetchTableData. One (blk_queue) >> is being filled from IndexNext, and then the queue in StreamingRead. > > I've changed the name from blk_queue to tid_queue to fix the issue you > mention in your later remarks. > I suppose there are two queues. The tid_queue is just to pass the > block requests to the streaming read API. The prefetch distance will > be the smaller of the two sizes. > FWIW I think the two queues are a nice / elegant approach. In hindsight my problems with trying to utilize the StreamingRead were due to trying to use the block-oriented API directly from places that work with TIDs, and this just makes that go away. I wonder what the overhead of shuffling stuff between queues will be, but hopefully not too high (that's my assumption). >>> The only change to the streaming read API is that now, even if the >>> callback returns InvalidBlockNumber, we may not be finished, so make >>> it resumable. >> >> Hmm, not sure when can the callback return InvalidBlockNumber before >> reaching the end. Perhaps for the first index_fetch_heap call? Any >> reason not to fill the blk_queue before calling index_fetch_heap? > > The callback will return InvalidBlockNumber whenever the queue is > empty. Let's say your queue size is 5 and your effective prefetch > distance is 10 (some combination of the PgStreamingReadRange sizes and > PgStreamingRead->max_ios). The first time you call index_fetch_heap(), > the callback returns InvalidBlockNumber. Then the tid_queue is filled > with 5 tids. Then index_fetch_heap() is called. > pg_streaming_read_look_ahead() will prefetch all 5 of these TID's > blocks, emptying the queue. Once all 5 have been dequeued, the > callback will return InvalidBlockNumber. > pg_streaming_read_buffer_get_next() will return one of the 5 blocks in > a buffer and save the associated TID in the per_buffer_data. Before > index_fetch_heap() is called again, we will see that the queue is not > full and fill it up again with 5 TIDs. So, the callback will return > InvalidBlockNumber 3 times in this scenario. > Thanks for the explanation. Yes, I didn't realize that the queues may be of different length, at which point it makes sense to return invalid block to signal the TID queue is empty. >>> Structurally, this changes the timing of when the heap blocks are >>> prefetched. Your code would get a tid from the index and then prefetch >>> the heap block -- doing this until it filled a queue that had the >>> actual tids saved in it. With my approach and the streaming read API, >>> you fetch tids from the index until you've filled up a queue of block >>> numbers. Then the streaming read API will prefetch those heap blocks. >> >> And is that a good/desirable change? I'm not saying it's not, but maybe >> we should not be filling either queue in one go - we don't want to >> overload the prefetching. > > We can focus on the prefetch distance algorithm maintained in the > streaming read API and then make sure that the tid_queue is larger > than the desired prefetch distance maintained by the streaming read > API. > Agreed. I think I wasn't quite right when concerned about "overloading" the prefetch, because that depends entirely on the StreamingRead API queue. A lage TID queue can't cause overload of anything. What could happen is a TID queue being too small, so the prefetch can't hit the target distance. But that can happen already, e.g. indexes that are correlated and/or index-only scans with all-visible pages. >>> There are also table AM layering violations in my sketch which would >>> have to be worked out (not to mention some resource leakage I didn't >>> bother investigating [which causes it to fail tests]). >>> >>> 0001 is all of Thomas' streaming read API code that isn't yet in >>> master and 0002 is my rough sketch of index prefetching using the >>> streaming read API >>> >>> There are also numerous optimizations that your index prefetching >>> patch set does that would need to be added in some way. I haven't >>> thought much about it yet. I wanted to see what you thought of this >>> approach first. Basically, is it workable? >> >> It seems workable, yes. I'm not sure it's much simpler than my patch >> (considering a lot of the code is in the optimizations, which are >> missing from this patch). >> >> I think the question is where should the optimizations happen. I suppose >> some of them might/should happen in the StreamingRead API itself - like >> the detection of sequential patterns, recently prefetched blocks, ... > > So, the streaming read API does detection of sequential patterns and > not prefetching things that are in shared buffers. It doesn't handle > avoiding prefetching recently prefetched blocks yet AFAIK. But I > daresay this would be relevant for other streaming read users and > could certainly be implemented there. > Yes, the "recently prefetched stuff" cache seems like a fairly natural complement to the pattern detection and shared-buffers check. FWIW I wonder if we should make some of this customizable, so that systems with customized storage (e.g. neon or with direct I/O) can e.g. disable some of these checks. Or replace them with their version. >> But I'm not sure what to do about optimizations that are more specific >> to the access path. Consider for example the index-only scans. We don't >> want to prefetch all the pages, we need to inspect the VM and prefetch >> just the not-all-visible ones. And then pass the info to the index scan, >> so that it does not need to check the VM again. It's not clear to me how >> to do this with this approach. > > Yea, this is an issue I'll need to think about. To really spell out > the problem: the callback dequeues a TID from the tid_queue and looks > up its block in the VM. It's all visible. So, it shouldn't return that > block to the streaming read API to fetch from the heap because it > doesn't need to be read. But, where does the callback put the TID so > that the caller can get it? I'm going to think more about this. > Yes, that's the problem for index-only scans. I'd generalize it so that it's about the callback being able to (a) decide if it needs to read the heap page, and (b) store some custom info for the TID. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company