public inbox for [email protected]
help / color / mirror / Atom feedFrom: Konstantin Knizhnik <[email protected]>
To: Tomas Vondra <[email protected]>
To: Peter Geoghegan <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Georgios <[email protected]>
Cc: Dilip Kumar <[email protected]>
Subject: Re: index prefetching
Date: Mon, 29 Dec 2025 18:34:47 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
<4zeu5yb73byiquvf3eefsunnrydyqfxy3eup66jrliutrtd4xl@5iifjey4n5m5>
<[email protected]>
<CAH2-WzkAC_gWQgc3MV3MWBx2EUaE4TfPk8XQPkJgs1R_qNvo=A@mail.gmail.com>
<CAH2-WznLRnUZHC4kmR+J3MgGpN0iTUaZ-4xeSHPiaYR=y3C8Og@mail.gmail.com>
<CAH2-WzkHK0++yn3_nmWL2rWzXiu8Qc=2=NPQDG4nfVjvYkRDrQ@mail.gmail.com>
<CAH2-Wzk9=x=a2TbcqYcX+XXmDHQr5=1v9m4Z_v8a-KwF1Zoz0A@mail.gmail.com>
<CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com>
<CAH2-Wzmm+mXuv_r+eYgcwOKZzNSmbYoEmxi5mdKMUtkmYa_qXw@mail.gmail.com>
<CAH2-WzkC_EdGB_tdEQr63BiOuJbcvZiV=xO+OJYux4wOaxPD8w@mail.gmail.com>
<CAH2-Wzn8whv-RE4E5L2WQB_Ha_jy0UReJq1yypu83gDYHeXq5g@mail.gmail.com>
<CAH2-Wzm7-QuDOs6TcqfhhDsGEZCuHtn=D-SriOTnTZ_fiXNBvA@mail.gmail.com>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
On 29/12/2025 1:53 AM, Tomas Vondra wrote:
>
> On 12/28/25 21:30, Konstantin Knizhnik wrote:
>> On 28/12/2025 8:08 PM, Tomas Vondra wrote:
>>> On 12/25/25 16:39, Konstantin Knizhnik wrote:
>>>> On 21/12/2025 7:55 PM, Peter Geoghegan wrote:
>>>>> On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <[email protected]> wrote:
>>>>>> Attached is v4.
>>>>> Attached is v5. Changes from v4:
>>>>>
>>>>> * Simplified and optimized index-only scans, with a particular
>>>>> emphasis on avoiding regressions with nested loop joins with an inner
>>>>> index-only scan.
>>>>>
>>>>> There were quite a number of small problems/dead code related to
>>>>> index-only scans fixed by this new v5. Overall, I'm quite a bit
>>>>> happier with the state of index-only scans, which I'd not paid too
>>>>> much attention to before now.
>>>>>
>>>>> * Added Valgrind instrumentation to the hash index patch, which was
>>>>> required to fix some false positives.
>>>>>
>>>>> The generic indexam_util_batch_unlock routine had Valgrind
>>>>> instrumentation in earlier versions, just to keep nbtree's buffer
>>>>> locking checks from generating similar false positives. Some time
>>>>> later, when I added the hashgetbatch patch, there were new Valgrind
>>>>> false positives during hash index scans -- which I missed at first.
>>>>> This new v5 revisions adds similar Valgrind checks to hash itself
>>>>> (changes that add code that is more or less a direct port of the stuff
>>>>> added to nbtree by commit 4a70f829), which fixes the false positives,
>>>>> and is independently useful.
>>>>>
>>>>> The rule for amgetbatch-based index AMs is that they must have similar
>>>>> buffer locking instrumentation. That seems like a good thing.
>>>>>
>>>>> --
>>>>> Peter Geoghegan
>>>> I the previous mail I shared results of my experiments with different
>>>> prefetch distance.
>>>> I think that we should start prefetching of heap tuples not from the
>>>> second batch, but after some number of proceeded tids.
>>>>
>>>> Attached please find a patch which implements this approach.
>>>> And below are updated results:
>>>>
>>>> limit\prefetch on off always inc threshold
>>>> 1 12074 12765 3146 3282 12394
>>>> 2 5912 6198 2463 2438 6124
>>>> 4 2919 3047 1334 1964 2910
>>>> 8 1554 1496 1166 1409 1588
>>>> 16 815 775 947 940 600
>>>> 32 424 403 687 695 478
>>>> 64 223 208 446 453 358
>>>> 128 115 106 258 270 232
>>>> 256 68 53 138 149 131
>>>> 512 43 27 72 78 71
>>>> 1024 28 13 38 40 38
>>>>
>>>> Last column is result of prefetch with read_stream_threshold=10.
>>>>
>>> That's great, but it only works for cases that can (and do) benefit from
>>> the prefetching. Try running the benchmark with a data set that fits
>>> into shared buffers (or RAM), which makes prefetching useless.
>>>
>>> I tried that with your test, comparing master, v5 and v5 + your
>>> read_stream_threshold patch. See the attached run.sh script, and the PDF
>>> summarizing the results. The last two column groups are comparisons to
>>> master, with green=improvement, red=regression. There are no actual
>>> improvements (1% delta is just noise). But the read_stream_threshold
>>> results have a clear pattern of pretty massive (20-30%) regressions.
>>>
>>> The difference between v5 and v5-threshold is pretty clear.
>>>
>>> IIRC cases like this are *exactly* why we ended up with the current
>>> heuristics, enabling prefetching only from the second batch. This
>>> removes the risk of expensive read_stream init for very fast queries
>>> that don't benefit anything. Of course, prefetching may be useless for
>>> later batches too (e.g. if all the data is cached), but the query will
>>> be expensive enough for the read_stream init cost to be negligible.
>>>
>>> To put this differently, the more aggressive the heuristics is (enabling
>>> prefetching in more case), the more likely it's to cause regressions.
>>> We've chosen to be more defensive, i.e. to sacrifice some possible gains
>>> in order to not regress plausible workloads. I hope we agree queries on
>>> fully cached "hot" data are pretty common / important.
>>>
>>> We can probably do better in the future. But we'll never know for sure
>>> if a given scan benefits from prefetching. It's not just about the
>>> number of items in the batch, but also about how many heap pages that
>>> translates to, what I/O pattern (random vs. sequential?), how many are
>>> already cached. For some queries we don't even know how many items we'll
>>> actually need. We can't check all that at the very beginning, because
>>> it's simply prohibitively expensive.
>>
>> I tried to reproduce your results, but at Mac I do not see some
>> noticeable difference for 250k records, fillfactor=10 and 4GB shared
>> buffers
>> between `enable_indexscan_prefetch=false` and
>> `enable_indexscan_prefetch=true`.
>> I can't believe that just adding this checks in `heap_batch_advance_pos`
>> can cause 75% degrade of performance (because for limit < 10, no read
>> stream is initialized, but still we somewhere loose 25%).
>>
>> I just commented this fragment of code in heapam_handler.c:
>>
>>
>> #if 0
>> proceed_items = ScanDirectionIsForward(direction)
>> ? pos->item - batch->firstItem
>> : batch->lastItem - pos->item;
>> /* Delay initializing stream until proceeding */
>> if (proceed_items >= read_stream_threshold
>> && !scan->xs_heapfetch->rs
>> && !scan->batchqueue->disabled
>> && !scan->xs_want_itup /* XXX prefetching disabled for IoS,
>> for now */
>> && enable_indexscan_prefetch)
>> {
>> scan->xs_heapfetch->rs =
>> read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
>> scan->heapRelation, MAIN_FORKNUM,
>> scan->heapRelation->rd_tableam->index_getnext_stream,
>> scan, 0);
>> }
>> #endif
>>
>> and ... see no difference.
>>
>> I can understand why initializing read stream earlier (not at the second
>> batch, but after 10 proceeded items) may have negative impact on
>> performance when all data is present i shared buffers for LIMIT>=10.
>> But how it can happen with LIMIT 1 and commented fragment above. There
>> is nothing else in my patch except adding GUC.
>> So I think that it is some "external" factor and wonder if you can
>> reproduce this results (just first line).
>>
> It seems this is due to sending an extra SET (for the new GUC) in the
> pgbench script, which is recognized only on the v5+threshold build.
>
> That's a thinko on my side, I should have realized the extra command
> might affect this. It doesn't really affect the behavior, because 10 is
> the default value for read_stream_threshold. I've fixed the script, will
> check fresh results tomorrow.
>
> Still, I think most of what I said about heuristics when to initialize
> the read stream, and the risk/benefit tradeoff, still applies.
>
>
> regards
Attached please find alternative version of the proposed patch which use
number of disk reads as criteria for using read stream.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index b9d42b15a18..125b8addd9b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -36,6 +36,7 @@
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
+#include "optimizer/cost.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
@@ -46,6 +47,8 @@
#include "utils/builtins.h"
#include "utils/rel.h"
+int read_stream_threshold = DEFAULT_READ_STREAM_THRESHOLD;
+
static void reform_and_rewrite_tuple(HeapTuple tuple,
Relation OldHeap, Relation NewHeap,
Datum *values, bool *isnull, RewriteState rwstate);
@@ -88,6 +91,7 @@ heapam_index_fetch_begin(Relation rel)
hscan->xs_base.rel = rel;
hscan->xs_base.rs = NULL;
+ hscan->xs_base.n_heap_reads = 0;
hscan->xs_cbuf = InvalidBuffer;
hscan->xs_blk = InvalidBlockNumber;
hscan->vmbuf = InvalidBuffer;
@@ -162,7 +166,22 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
if (scan->rs)
hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
else
- hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+ {
+ ReadBuffersOperation operation;
+ operation.smgr = RelationGetSmgr(hscan->xs_base.rel);
+ operation.rel = hscan->xs_base.rel;
+ operation.persistence = hscan->xs_base.rel->rd_rel->relpersistence;
+ operation.forknum = MAIN_FORKNUM;
+ operation.strategy = NULL;
+ if (StartReadBuffer(&operation,
+ &hscan->xs_cbuf,
+ hscan->xs_blk,
+ READ_BUFFERS_SYNCHRONOUSLY))
+ {
+ WaitReadBuffers(&operation);
+ scan->n_heap_reads += 1;
+ }
+ }
/*
* Prune page when it is pinned for the first time
@@ -288,6 +307,20 @@ heap_batch_advance_pos(IndexScanDesc scan, struct BatchQueueItemPos *pos,
*/
batch = INDEX_SCAN_BATCH(scan, pos->batch);
+ /* Delay initializing stream until proceeding */
+ if (!scan->xs_heapfetch->rs
+ && !scan->batchqueue->disabled
+ && !scan->xs_want_itup /* XXX prefetching disabled for IoS, for now */
+ && enable_indexscan_prefetch
+ && scan->xs_heapfetch->n_heap_reads >= (uint64)read_stream_threshold) /* -1 -> +inf */
+ {
+ scan->xs_heapfetch->rs =
+ read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+ scan->heapRelation, MAIN_FORKNUM,
+ scan->heapRelation->rd_tableam->index_getnext_stream,
+ scan, 0);
+ }
+
if (ScanDirectionIsForward(direction))
{
if (++pos->item <= batch->lastItem)
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index 29207276dca..d2fa5519378 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -169,17 +169,6 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
batchqueue->headBatch, batchqueue->nextBatch, batch);
-
- /* Delay initializing stream until reading from scan's second batch */
- if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
- !scan->xs_want_itup && /* XXX prefetching disabled for IoS, for
- * now */
- enable_indexscan_prefetch)
- scan->xs_heapfetch->rs =
- read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
- scan->heapRelation, MAIN_FORKNUM,
- scan->heapRelation->rd_tableam->index_getnext_stream,
- scan, 0);
}
else
scan->finished = true;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 8f6fa6843cb..0c0819e4d13 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2322,6 +2322,16 @@
max => 'DBL_MAX',
},
+{ name => 'read_stream_threshold', type => 'int', context => 'PGC_USERSET', group => 'QUERY_TUNING_COST',
+ short_desc => 'Minimal number of heap reads during index scan for creation of read stream',
+ long_desc => 'Index scan needs to read heap to check visibility of tuples and get attributes not present in index key. Read stream allows to do it asynchronously which adds extra overhead, but allows to significantly increase speed for long scans. Specify -1 to disable.',
+ flags => 'GUC_EXPLAIN',
+ variable => 'read_stream_threshold',
+ boot_val => 'DEFAULT_READ_STREAM_THRESHOLD',
+ min => '-1',
+ max => 'INT_MAX',
+},
+
{ name => 'recovery_end_command', type => 'string', context => 'PGC_SIGHUP', group => 'WAL_ARCHIVE_RECOVERY',
short_desc => 'Sets the shell command that will be executed once at the end of recovery.',
variable => 'recoveryEndCommand',
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 1157ba9ba9d..9690d89ad8a 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -126,6 +126,7 @@ typedef struct IndexFetchTableData
{
Relation rel;
ReadStream *rs;
+ uint64 n_heap_reads; /* number of heap page read from the disk */
} IndexFetchTableData;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 00f4c3d0011..97150433c99 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -28,6 +28,7 @@
#define DEFAULT_CPU_OPERATOR_COST 0.0025
#define DEFAULT_PARALLEL_TUPLE_COST 0.1
#define DEFAULT_PARALLEL_SETUP_COST 1000.0
+#define DEFAULT_READ_STREAM_THRESHOLD 10
/* defaults for non-Cost parameters */
#define DEFAULT_RECURSIVE_WORKTABLE_FACTOR 10.0
@@ -72,6 +73,7 @@ extern PGDLLIMPORT bool enable_partition_pruning;
extern PGDLLIMPORT bool enable_presorted_aggregate;
extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
+extern PGDLLIMPORT int read_stream_threshold;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
double index_pages, PlannerInfo *root);
Attachments:
[text/plain] read_stream_threshold_v2.patch (5.8K, 2-read_stream_threshold_v2.patch)
download | inline diff:
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index b9d42b15a18..125b8addd9b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -36,6 +36,7 @@
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
+#include "optimizer/cost.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
@@ -46,6 +47,8 @@
#include "utils/builtins.h"
#include "utils/rel.h"
+int read_stream_threshold = DEFAULT_READ_STREAM_THRESHOLD;
+
static void reform_and_rewrite_tuple(HeapTuple tuple,
Relation OldHeap, Relation NewHeap,
Datum *values, bool *isnull, RewriteState rwstate);
@@ -88,6 +91,7 @@ heapam_index_fetch_begin(Relation rel)
hscan->xs_base.rel = rel;
hscan->xs_base.rs = NULL;
+ hscan->xs_base.n_heap_reads = 0;
hscan->xs_cbuf = InvalidBuffer;
hscan->xs_blk = InvalidBlockNumber;
hscan->vmbuf = InvalidBuffer;
@@ -162,7 +166,22 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
if (scan->rs)
hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
else
- hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+ {
+ ReadBuffersOperation operation;
+ operation.smgr = RelationGetSmgr(hscan->xs_base.rel);
+ operation.rel = hscan->xs_base.rel;
+ operation.persistence = hscan->xs_base.rel->rd_rel->relpersistence;
+ operation.forknum = MAIN_FORKNUM;
+ operation.strategy = NULL;
+ if (StartReadBuffer(&operation,
+ &hscan->xs_cbuf,
+ hscan->xs_blk,
+ READ_BUFFERS_SYNCHRONOUSLY))
+ {
+ WaitReadBuffers(&operation);
+ scan->n_heap_reads += 1;
+ }
+ }
/*
* Prune page when it is pinned for the first time
@@ -288,6 +307,20 @@ heap_batch_advance_pos(IndexScanDesc scan, struct BatchQueueItemPos *pos,
*/
batch = INDEX_SCAN_BATCH(scan, pos->batch);
+ /* Delay initializing stream until proceeding */
+ if (!scan->xs_heapfetch->rs
+ && !scan->batchqueue->disabled
+ && !scan->xs_want_itup /* XXX prefetching disabled for IoS, for now */
+ && enable_indexscan_prefetch
+ && scan->xs_heapfetch->n_heap_reads >= (uint64)read_stream_threshold) /* -1 -> +inf */
+ {
+ scan->xs_heapfetch->rs =
+ read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+ scan->heapRelation, MAIN_FORKNUM,
+ scan->heapRelation->rd_tableam->index_getnext_stream,
+ scan, 0);
+ }
+
if (ScanDirectionIsForward(direction))
{
if (++pos->item <= batch->lastItem)
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index 29207276dca..d2fa5519378 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -169,17 +169,6 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
batchqueue->headBatch, batchqueue->nextBatch, batch);
-
- /* Delay initializing stream until reading from scan's second batch */
- if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
- !scan->xs_want_itup && /* XXX prefetching disabled for IoS, for
- * now */
- enable_indexscan_prefetch)
- scan->xs_heapfetch->rs =
- read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
- scan->heapRelation, MAIN_FORKNUM,
- scan->heapRelation->rd_tableam->index_getnext_stream,
- scan, 0);
}
else
scan->finished = true;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 8f6fa6843cb..0c0819e4d13 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2322,6 +2322,16 @@
max => 'DBL_MAX',
},
+{ name => 'read_stream_threshold', type => 'int', context => 'PGC_USERSET', group => 'QUERY_TUNING_COST',
+ short_desc => 'Minimal number of heap reads during index scan for creation of read stream',
+ long_desc => 'Index scan needs to read heap to check visibility of tuples and get attributes not present in index key. Read stream allows to do it asynchronously which adds extra overhead, but allows to significantly increase speed for long scans. Specify -1 to disable.',
+ flags => 'GUC_EXPLAIN',
+ variable => 'read_stream_threshold',
+ boot_val => 'DEFAULT_READ_STREAM_THRESHOLD',
+ min => '-1',
+ max => 'INT_MAX',
+},
+
{ name => 'recovery_end_command', type => 'string', context => 'PGC_SIGHUP', group => 'WAL_ARCHIVE_RECOVERY',
short_desc => 'Sets the shell command that will be executed once at the end of recovery.',
variable => 'recoveryEndCommand',
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 1157ba9ba9d..9690d89ad8a 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -126,6 +126,7 @@ typedef struct IndexFetchTableData
{
Relation rel;
ReadStream *rs;
+ uint64 n_heap_reads; /* number of heap page read from the disk */
} IndexFetchTableData;
/*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 00f4c3d0011..97150433c99 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -28,6 +28,7 @@
#define DEFAULT_CPU_OPERATOR_COST 0.0025
#define DEFAULT_PARALLEL_TUPLE_COST 0.1
#define DEFAULT_PARALLEL_SETUP_COST 1000.0
+#define DEFAULT_READ_STREAM_THRESHOLD 10
/* defaults for non-Cost parameters */
#define DEFAULT_RECURSIVE_WORKTABLE_FACTOR 10.0
@@ -72,6 +73,7 @@ extern PGDLLIMPORT bool enable_partition_pruning;
extern PGDLLIMPORT bool enable_presorted_aggregate;
extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
+extern PGDLLIMPORT int read_stream_threshold;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
double index_pages, PlannerInfo *root);
view thread (348+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: index prefetching
In-Reply-To: <[email protected]>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox