Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w2TgW-000KBp-1j for pgsql-hackers@arkaria.postgresql.org; Tue, 17 Mar 2026 12:38:08 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w2TgU-001E2d-0f for pgsql-hackers@arkaria.postgresql.org; Tue, 17 Mar 2026 12:38:06 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w2TgT-001E2U-1f for pgsql-hackers@lists.postgresql.org; Tue, 17 Mar 2026 12:38:05 +0000 Received: from mail-ej1-x630.google.com ([2a00:1450:4864:20::630]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w2TgN-00000000BTM-2sJK for pgsql-hackers@postgresql.org; Tue, 17 Mar 2026 12:38:04 +0000 Received: by mail-ej1-x630.google.com with SMTP id a640c23a62f3a-b97ed4ad579so16556666b.3 for ; Tue, 17 Mar 2026 05:38:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773751079; cv=none; d=google.com; s=arc-20240605; b=kKMJ0V24tZNSaNrqV+liapZeq56OZE2ozx3boY0osfQodCH5dWgvfBVq58BYvsGcoC VjvK8lfo08AvwgIICM8SmjCJjZpNgsiH9S876XRx+J1o2t7eT0uiLTA9LQSUQqp6qLwh pzFbq4hsAiN8ujSQUdMMtHsZb01baXwkVbd4NX0wQAmzo0vbG4SWpPwmX8yqgVmLLky4 1coRCFmRK0qLI0lcvk99zBxXWORVVMR4ut02y+gMN1dne9B1kpFGBJjURMAixWNS6brH 9sM3UZWIm7j48WN+1q5wdsxxpOZj3huBCrI4zp3lKI5zld6SREDHuLDQUGjAQg2Pg3hH 8r5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=pS6TJIKRfXafdlA4OWP4PqcFfLkKQDowXk1ud0eJ54M=; fh=08EifIfup7JDgtEHVehZzlm/MqocH8XVSs3DoSizXhk=; b=jyEF4lxXbGGFLtMag+1tgg3GXzmUDUPwd8wKmrHNy1GGjTmfI6G5ej7jPJ82NVHpKJ CQnL2kO/G/rB91K42dsvuGZ2B2ypcEwxYpKxfqv4S8YP3bmo+dLUfs4dAM9NtVCjCm/j iTqWbCYgKZXplrlXx3kta8wMJP6q5qjKoNz2Em6Xw+7VOw/FjClIbdsoBKyw04aj5d4U KYNpg7D+CcENHLyWVR+Bvt6zMkm7UaLpKLNICweszLuvtLiAZXpStU39X4x2kx6FiIUF 21K+zutDrUyeaFAYbDukqBztdI03J3650M7c4ljtZxGQkB2uNZ/hlVLrm2dwad+c0nsM E9Lw==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773751079; x=1774355879; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=pS6TJIKRfXafdlA4OWP4PqcFfLkKQDowXk1ud0eJ54M=; b=FdHoTAy2z1WbbnXDjhx3UcG5IXe07SejdaXwDThYVgni8UTd495PbBh73VAN3PRyJv DL2bvEfELBIBUXvZJQ9ws5fGr/mIup7yuzhbyxxnBr6HE0owqVJXf2ihVdENV8UK3J0u h2xp9BQUpkQsV7h7MBsP9d061R0a/55RBDI3LrmJFmoX7qPDQJYU65xDpEgWFH+Nd7tQ 7L1VxyPoH30IOZVvH1m98bhTjthhzeWK2B7bctOd8esN8+sVouC9avQIZYQh6YKpYhYI cfLlfJ7aEoAEKTNVhVA7reN0pZbG1+MbkOmGDq6GFPBfJeAVok8bveNFJ0O2eaiML4BJ E48w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1773751079; x=1774355879; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=pS6TJIKRfXafdlA4OWP4PqcFfLkKQDowXk1ud0eJ54M=; b=aoIC9pOypJcq+g1wnK9L3m4xuJ/WOqXhSUD9uLL1m7ivyhC5szoETkhdDGwyrZTNqe by8eig2N3ovRRxpc/sEeHoo7HVP/s9v2hFfYnfoDb9CwfJRxnwBz4m4ZO8oZ4tN+6036 O4j2JCBZnQyio4R0/KmXed0fVuN9ZD3svY7I7yphd6jZLd/MXlFQ+4inIfTpwEN4OPFC mJ1ed8OSPu6rpUzHlBk31pKyr6TGXD0i05uwPqktYLa1pJ82qnO2fJGEcHDoXmhNIZ5e Gpe+oOXiDZvU8muQjiaRdEdHPbBHLAo0UZk4OKgpp3Umb8rGNBwJpLfk/+xoPMZi7hck yH9w== X-Gm-Message-State: AOJu0Yz11X30SYoBujR9IprVdDMMKthf5vV1UMrymh6mF3ySRkWS3TV5 xGq/fqme5hC7JAD9wPaG/xZWLxXkVBKgFqPlxrBZUXWEn49YE7EuuYF1JiIY87VBAbuEnptLvIa H2GU9MJOrd10C6RN0YuHrfH23mo/bFK5oHMa8 X-Gm-Gg: ATEYQzxph/BqKSpvFhwDKwggcbG9mpupZ6OtydcupO6h+mxiOIN0/BRQ7JRFEmQQhhU acHESeVTepmTmfn/IPdpTPf/19aXm2f/YhiMsOdJ+lbD6XRZSGVILj9rlejU100ag7phHmrPtty w7B4u0NOSR9g5LTwJFPb8SANhQIsbfK9IhOrRm90EAbH0lMSk52rmjLx77qOakRupUlllJlIkwN 4+u1RIZUXcfQZpRnBUN4Oi9qCvrk/hlCyt4ZJHwiQ59X46NZ12YnA+xD/r+ogznkGnEb1eTWaIy I6uQRtuH9dv0qv1RtFCgEFb/zrBVTUcnbF9XoM5O4nslaQboWC8m/YHBrBAm X-Received: by 2002:a17:907:3cc5:b0:b87:2471:def3 with SMTP id a640c23a62f3a-b97651e7e10mr942191366b.55.1773751078753; Tue, 17 Mar 2026 05:37:58 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Alexandre Felipe Date: Tue, 17 Mar 2026 12:37:47 +0000 X-Gm-Features: AaiRm51MV4yLaJxwVCQdmxjH-ubU1edQfPAVFJPRwnL-08SfLPlkQxTl48Qsrkg Message-ID: Subject: Re: New access method for b-tree. To: "pgsql-hackers@postgresql.org" , michael@paquier.xyz, tgl@sss.pgh.pa.us, "peter@eisentraut.org" Cc: Ants Aasma , Tomas Vondra , Alexandre Felipe , =?UTF-8?B?TWljaGHFgiBLxYJlY3plaw==?= , Andres Freund Content-Type: multipart/alternative; boundary="000000000000f826dc064d379ae2" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000f826dc064d379ae2 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Happy St. Patrick's day! Based on what I said said in previous emails I see alternative proposals #1 Make it simpler by not changing the index access methods. #2 Make it optimal in some sense by not using generic index searches and not keeping multiple open index scans. and #3 Follow the pragmatic approach Objective is, minimize the number of heap fetches. As high level as possible, reusing existing functions instead of writing custom code when possible. Ants Aasma & Tomas Vondra > > My workarounds I have proposed users have been either to rewrite the > > query as a UNION ALL of a set of single value prefix queries wrapped > > in an order by limit. This gives the exact needed merge append plan > > shape. But repeating the query N times can get unwieldy when the > > number of values grows, so the fallback is: > > > > SELECT * FROM unnest(:friends) id, LATERAL ( > > SELECT * FROM posts > > WHERE user_id =3D id > > ORDER BY tstamp DESC LIMIT 100) > > ORDER BY tstamp DESC LIMIT 100; > > > > The downside of this formulation is that we still have to fetch a > > batch worth of items from scans where we otherwise would have only had > > to look at one index tuple. > > > True. It's useful to think about the query this way, and it may be > better than full select + sort, but it has issues too. > An issue with this query is generality, if this is joined with other queries we can't determine in advance the limit. > The main problem I can see is that at planning time the cardinality of > > the prefix array might not be known, and in theory could be in the > > millions. Having millions of index scans open at the same time is not > > viable, so the method needs to somehow degrade gracefully. The idea I > > had is to pick some limit, based on work_mem and/or benchmarking, and > > one the limit is hit, populate the first batch and then run the next > > batch of index scans, merging with the first result. Or something like > > that, I can imagine a few different ways to handle it with different > > tradeoffs. > > > > Doesn't the proposed merge scan have a similar issue? Because that will > also have to keep all the index scans open (even if only internally). > Indeed, it needs to degrade gracefully, in some way. It is true, but I think we can trust the planner. This problem scales similarly in a memoize node. Is ~24kB for each open index scan a good guess? ALTERNATIVE #1 - More efficient Or to avoid having N open index scans we could (??) (1) find the index page for the head of each prefix. (2) for each prefix (2.a) load tuples from each head page (2.b) if we consume the last tuple in a page save a pointer to the next page. (2.c) check if tuples for the next prefix are in the same page (2.d) Release the page. (3) producing tuples in the suffix order (3.b) when tuples for prefix are exhausted load load page from (2.b) Step 2.a. would possibly waste time extracting tuples that are not needed, and memory by storing them. Not sure how efficient this can be compared to having an open index scan. Matthias van de Meent, Feb 3 > btree index skip scan infrastructure efficiently prevents new index > descents into the index when the selected SAOP key ranges are directly > adjecent, while merge scan would generally do at least one index > descent for each of its N scan heads (*) - which in the proposed > prototype patch guarantees O(index depth * num scan heads) buffer > accesses. This could also be addressed if we do this custom descent, I didn't bother about that depth factor because with a few random prefixes doing so we are probably going to save accesses only for the top level. I would prefer to start with a very conceptual implementation that can already provide 1000x speedup, but if you think this way is better, I am open to try it. I think this can be done without affecting the planner logic and the PrefixJoin node. I'm afraid the > proposed batches execution will be rather complex, so I'd say v1 should > simply have a threshold, and do the full scan + sort for more items. Do you mean by an executor node that performs the query as if it was writte= n ALTERNATIVE #2 - Simpler(??) for each _prefix of prefixes: result +=3D (SELECT FROM table WHERE prefix =3D _prefix AND qual(*) ORDER BY suffix LIMIT N) return SELECT * FROM result ORDER BY suffix LIMIT N This query may have to produce N * len(prefixes) rows, while the original proposal would produce only N + len(prefixes) - 1. Comparing to the previous results, > | Method | Shared Hit | Shared Read | Exec Time | > |------------|-----------:|------------:|----------:| > | Merge | 13 | 119 | 13 ms | > | IndexScan | 15,308 | 525,310 | 3,409 ms | This Prefix Batch Scan approach hit=3D62 read=3D773, Execution Time: 80.815 ms With 8 prefixes, the execution time increased to 80.8/13 ~ 6.21 And the number of buffers by 835/132 ~ 6.32 > I can imagine that this would really nicely benefit from > ReadStream'ification. > > > > Not sure, maybe. > Actually as I was watching the index prefetch development I was quite uncertain about how this would play with that, but we can probably simply give a budget for each stream. > One other connection I see is with block nested loops. In a perfect > > future PostgreSQL could run the following as a set of merged index > > scans that terminate early: > > > > SELECT posts.* > > FROM follows f > > JOIN posts p ON f.followed_id =3D p.user_id > > WHERE f.follower_id =3D :userid > > ORDER BY p.tstamp DESC LIMIT 100; > > > > In practice this is not a huge issue - it's not that hard to transform > > this to array_agg and =3D ANY subqueries. > > Automating that transformation seems quite non-trivial (to me). > Well, not trivial. To give a rough idea. wc -l *.patch 113 v2-0001-Test-the-baseline.patch 614 v2-0002-Access-method.patch 850 v2-0003-Planner-integration.patch 1958 v2-0004-Multi-column.patch 2439 v2-0005-Joins.patch it is missing some important details like prefix deduplication but for the scenario where the values on the other table are known to be unique it is good. The multi column accepts things like A in (...) B in (...) and computes the cartesian product or (A, B) IN (...) Regards, Alexandre On Mon, Feb 23, 2026 at 10:08=E2=80=AFPM Alexandre Felipe < o.alexandre.felipe@gmail.com> wrote: > Hi Hackers, > > Do you think that MERGE-SCAN was a terrible name? I wanted a name that > wouldn't > > require much explanation. I named it like this because it relies on a k-w= ay > > > merge to combine several segments of an index in one result. But we alrea= dy > > > have a MERGE statement. Even in the example plan above we can see > an external > > merge that has nothing to do with the new feature, and now as I am doing > joins, > > I started doing it on the NestedLoop trying to follow the same conditions > that > > lead to a memoize. But I added so many fields to the NestedLoop state tha= t > I > > think it is good to have a separate structure, and maybe a separate node, > and > > MergeScan of course is taken hehe. I was thinking of IndexPrefixMerge. We > could > > use the Ants nickname TimeLineScan, but of course it is not limited to ti= me > > > lines (even though realistically, that will probably be the most common > use of > > this). Another one I considered was TransposedIndexScan, because it order= s > > > output on (suffix, prefix) instead of (prefix, suffix). > > > > > On Fri, Feb 6, 2026 at 10:52=E2=80=AFAM Alexandre Felipe < > o.alexandre.felipe@gmail.com> wrote: > >> Hello again hackers! >> >> +pt@bowt.ie : That seems to be the one that is probably the >> most familiar with the index scan (based on the commits). >> +michael@paquier.xyz , +tgl@sss.pgh.pa.us >> , +peter@eisentraut.org as >> the top 3 committers to nbtree over the last ~6 months. >> >> I have made substantial progress on adding a few features. I have >> questions, but I will let you go first :) >> >> Motivation: >> *In technical terms:* this proposal is to take advantage of a btree >> index when the query is filtered by a few distinct prefixes and ordered = by >> a suffix and has a limit. >> *In non technical:* This could help to efficiently render a social >> network feed, where each user can select a list of users whose posts the= y >> want to see, and the posts must be ordered from newest to oldest. >> >> >> *Performance Comparison* >> I did a test with a toy table, please find more details below. >> >> With limit 100 >> >> | Method | Shared Hit | Shared Read | Exec Time | >> |------------|-----------:|------------:|----------:| >> | Merge | 13 | 119 | 13 ms | >> | IndexScan | 15,308 | 525,310 | 3,409 ms | >> >> With limit 1,000,000 >> >> | Method | SharedHit | SharRead | Temp I | Temp O | Exec Time | >> |------------|-----------:|---------:|-------:|-------:|----------:| >> | Merge | 980,318 | 19,721 | 0 | 0 | 2,128 ms | >> | Sequential | 15,208 | 525,410 | 20,207 | 35,384 | 3,762 ms | >> | Bitmap | 629 | 113,759 | 20,207 | 35,385 | 5,487 ms | >> | IndexScan | 7,880,619 | 126,706 | 20,945 | 35,386 | 5,874 ms | >> >> Sequential scans and bitmap scans in this case reduces significantly the >> number of >> accessed buff because the table has only four integer columns, and these >> methods >> can read all the lines on a given page at a time. >> >> However that comes at the cost of resorting to an in-disk sort method. >> For the query with limit 100 we get no temp files as we are using a >> top-100 sort. >> >> make check passes >> >> >> *Experiment details* >> >> Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100 >> >> >> ```sql >> CREATE TABLE grid AS ( >> SELECT a, b, c, d, FROM >> generate_series(1, 100) AS a, >> generate_series(1, 100) AS b, >> generate_series(1, 100) AS c, >> generate_series(1, 100) AS d >> ); >> >> CREATE INDEX grid_index ON grid (a, b, c); >> ANALYSE grid; >> ``` >> >> Now let's say that we need to find certain number of rows filtered by a >> and ordered by b; >> ```sql >> PREPARE grid_query(int) AS >> SELECT sum(d) FROM ( >> SELECT * FROM grid >> WHERE a IN (2,3,5,8,13,21,34,55) AND b >=3D 0 >> ORDER BY b >> LIMIT $1) t; >> ``` >> >> --- >> >> >> Now with limit 100, with index merge scan (notice Index Prefixes in the >> plan). >> >> ```sql >> SET enable_indexmergescan =3D on; >> EXPLAIN (ANALYSE) EXECUTE grid_query(100); >> ``` >> >> ```text >> Buffers: shared hit=3D13 read=3D119 >> -> Limit (cost=3D0.57..87.29 rows=3D100 width=3D16) (actual >> time=3D5.528..12.999 rows=3D100.00 loops=3D1) >> Buffers: shared hit=3D13 read=3D119 >> -> Index Scan using grid_a_b_c_idx on grid (cost=3D0.57..93.3= 6 >> rows=3D107 width=3D16) (actual time=3D5.528..12.994 rows=3D100.00 loops= =3D1) >> Index Cond: (b >=3D 0) >> *Index Prefixes: *(a =3D ANY >> ('{2,3,5,8,13,21,34,55}'::integer[])) >> Index Searches: 8 >> Buffers: shared hit=3D13 read=3D119 >> Planning: >> Buffers: shared hit=3D59 read=3D23 >> Planning Time: 4.619 ms >> Execution Time: 13.055 ms >> ``` >> >> >> ```sql >> SET enable_indexmergescan =3D off; >> EXPLAIN (ANALYSE) EXECUTE grid_query(100); >> ``` >> >> ```text >> Aggregate (cost=3D1603588.06..1603588.07 rows=3D1 width=3D8) (actual >> time=3D3406.624..3408.710 rows=3D1.00 loops=3D1) >> Buffers: shared hit=3D15308 read=3D525310 >> -> Limit (cost=3D1603575.17..1603586.81 rows=3D100 width=3D16) (act= ual >> time=3D3406.601..3408.702 rows=3D100.00 loops=3D1) >> Buffers: shared hit=3D15308 read=3D525310 >> -> Gather Merge (cost=3D1603575.17..2514342.92 rows=3D7819999 >> width=3D16) (actual time=3D3406.598..3408.695 rows=3D100.00 loops=3D1) >> Workers Planned: 2 >> Workers Launched: 2 >> Buffers: shared hit=3D15308 read=3D525310 >> -> Sort (cost=3D1602575.14..1610720.98 rows=3D3258333 >> width=3D16) (actual time=3D3393.782..3393.784 rows=3D100.00 loops=3D3) >> Sort Key: grid.b >> Sort Method: top-N heapsort Memory: 32kB >> Buffers: shared hit=3D15308 read=3D525310 >> Worker 0: Sort Method: top-N heapsort Memory: 32k= B >> Worker 1: Sort Method: top-N heapsort Memory: 32k= B >> -> *Parallel Seq Scan* on grid >> (cost=3D0.00..1478044.00 rows=3D3258333 width=3D16) (actual time=3D0.94= 4..3129.896 >> rows=3D2666666.67 loops=3D3) >> Filter: ((b >=3D 0) AND (a =3D ANY >> ('{2,3,5,8,13,21,34,55}'::integer[]))) >> Rows Removed by Filter: 30666667 >> Buffers: shared hit=3D15234 read=3D525310 >> Planning Time: 0.370 ms >> Execution Time: 3409.134 ms >> ``` >> >> Now queries with limit 1,000,000 >> >> ```sql >> SET enable_indexmergescan =3D on; >> EXPLAIN ANALYSE EXECUTE grid_query(1000000); >> ``` >> >> Query executed with the proposed access method. Notice in the plan Index >> Prefixes and Index Cond. >> ```text >> Buffers: shared hit=3D980318 read=3D19721 >> -> Limit (cost=3D0.57..867259.84 rows=3D1000000 width=3D16) (actual >> time=3D2.854..2103.438 rows=3D1000000.00 loops=3D1) >> Buffers: shared hit=3D980318 read=3D19721 >> -> Index Scan using grid_a_b_c_idx on grid >> (cost=3D0.57..867265.91 rows=3D1000007 width=3D16) (actual time=3D2.852= ..2066.205 >> rows=3D1000000.00 loops=3D1) >> Index Cond: (b >=3D 0) >> *Index Prefixes:* (a =3D ANY >> ('{2,3,5,8,13,21,34,55}'::integer[])) >> Index Searches: 8 >> Buffers: shared hit=3D980318 read=3D19721 >> Planning Time: 0.328 ms >> Execution Time: 2127.811 ms >> ``` >> >> If we disable index_mergescan we naturally we fall into a sequential sca= n. >> >> ```sql >> SET enable_indexmergescan =3D off; >> EXPLAIN ANALYSE EXECUTE grid_query(1000000); >> ``` >> ```text >> Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207 written= =3D35384 >> -> Limit (cost=3D1942895.64..2059362.12 rows=3D1000000 width=3D16) = (actual >> time=3D3467.012..3712.044 rows=3D1000000.00 loops=3D1) >> Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207 >> written=3D35384 >> -> Gather Merge (cost=3D1942895.64..2853663.39 rows=3D7819999 >> width=3D16) (actual time=3D3467.010..3671.220 rows=3D1000000.00 loops=3D= 1) >> Workers Planned: 2 >> Workers Launched: 2 >> Buffers: shared hit=3D15208 read=3D525410, temp read=3D20= 207 >> written=3D35384 >> -> Sort (cost=3D1941895.62..1950041.45 rows=3D3258333 >> width=3D16) (actual time=3D3455.852..3476.358 rows=3D334576.33 loops=3D3= ) >> Sort Key: grid.b >> Sort Method: *external merge Disk: 47016kB* >> Buffers: shared hit=3D15208 read=3D525410, temp >> read=3D20207 written=3D35384 >> Worker 0: Sort Method: external merge Disk: 46976= kB >> Worker 1: Sort Method: external merge Disk: 47000= kB >> -> *Parallel Seq Scan* on grid >> (cost=3D0.00..1478044.00 rows=3D3258333 width=3D16) (actual time=3D2.78= 9..2779.483 >> rows=3D2666666.67 loops=3D3) >> Filter: ((b >=3D 0) AND (a =3D ANY >> ('{2,3,5,8,13,21,34,55}'::integer[]))) >> Rows Removed by Filter: 30666667 >> Buffers: shared hit=3D15134 read=3D525410 >> Planning Time: 0.332 ms >> Execution Time: 3761.866 ms >> ``` >> >> If we disable sequential scans, then we get a bitmap scan >> >> ```sql >> SET enable_seqscan =3D off; >> EXPLAIN ANALYSE EXECUTE grid_query(1000000); >> ``` >> ```text >> Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read=3D2020= 7 >> written=3D35385 >> -> Limit (cost=3D1998199.78..2114666.26 rows=3D1000000 width=3D16) = (actual >> time=3D5170.456..5453.433 rows=3D1000000.00 loops=3D1) >> Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read= =3D20207 >> written=3D35385 >> -> Gather Merge (cost=3D1998199.78..2908967.53 rows=3D7819999 >> width=3D16) (actual time=3D5170.455..5413.254 rows=3D1000000.00 loops=3D= 1) >> Workers Planned: 2 >> Workers Launched: 2 >> Buffers: shared hit=3D629 read=3D113759 written=3D2, temp >> read=3D20207 written=3D35385 >> -> Sort (cost=3D1997199.75..2005345.59 rows=3D3258333 >> width=3D16) (actual time=3D5156.929..5177.507 rows=3D334500.67 loops=3D3= ) >> Sort Key: grid.b >> Sort Method: external merge Disk: 47032kB >> Buffers: shared hit=3D629 read=3D113759 written=3D2= , temp >> read=3D20207 written=3D35385 >> Worker 0: Sort Method: external merge Disk: 47280= kB >> Worker 1: Sort Method: external merge Disk: 46680= kB >> -> Parallel Bitmap Heap Scan on grid >> (cost=3D107691.54..1533348.13 rows=3D3258333 width=3D16) (actual >> time=3D299.891..4489.787 rows=3D2666666.67 loops=3D3) >> Recheck Cond: ((a =3D ANY >> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0)) >> Rows *Removed by Index Recheck*: 2410242 >> Heap Blocks: exact=3D13100 lossy=3D22639 >> Buffers: shared hit=3D615 read=3D113759 writt= en=3D2 >> Worker 0: Heap Blocks: exact=3D13077 lossy= =3D22755 >> Worker 1: Heap Blocks: exact=3D13036 lossy= =3D22421 >> -> *Bitmap Index Scan* on grid_a_b_c_idx >> (cost=3D0.00..105736.54 rows=3D7820000 width=3D0) (actual time=3D297.65= 1..297.651 >> rows=3D8000000.00 loops=3D1) >> Index Cond: ((a =3D ANY >> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0)) >> Index Searches: 7 >> Buffers: shared hit=3D13 read=3D7293 >> written=3D2 >> Planning Time: 0.165 ms >> Execution Time: 5487.213 ms >> ``` >> >> If we disable bitmap scans we finally get an index scan >> >> ```sql >> SET enable_bitmapscan =3D off; >> EXPLAIN ANALYSE EXECUTE grid_query(1000000); >> ``` >> ``` >> Buffers: shared hit=3D7883221 read=3D124111, temp read=3D20699 writte= n=3D35385 >> -> Limit (cost=3D7201203.08..7317669.55 rows=3D1000000 width=3D16) = (actual >> time=3D4414.478..4674.400 rows=3D1000000.00 loops=3D1) >> Buffers: shared hit=3D7883221 read=3D124111, temp read=3D20699 >> written=3D35385 >> -> Gather Merge (cost=3D7201203.08..8111970.83 rows=3D7819999 >> width=3D16) (actual time=3D4414.476..4633.982 rows=3D1000000.00 loops=3D= 1) >> Workers Planned: 2 >> Workers Launched: 2 >> Buffers: shared hit=3D7883221 read=3D124111, temp read=3D= 20699 >> written=3D35385 >> -> Sort (cost=3D7200203.05..7208348.88 rows=3D3258333 >> width=3D16) (actual time=3D4390.625..4411.896 rows=3D334567.00 loops=3D3= ) >> Sort Key: grid.b >> Sort Method: *external merge Disk: 47304kB* >> Buffers: shared hit=3D7883221 read=3D124111, temp >> read=3D20699 written=3D35385 >> Worker 0: Sort Method: external merge Disk: 47304= kB >> Worker 1: Sort Method: external merge Disk: 46384= kB >> -> *Parallel Index Scan* using grid_a_b_c_idx on >> grid (cost=3D0.57..6736351.43 rows=3D3258333 width=3D16) (actual >> time=3D46.925..3796.915 rows=3D2666666.67 loops=3D3) >> Index Cond: ((a =3D ANY >> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0)) >> Index Searches: 7 >> Buffers: shared hit=3D7883208 read=3D124110 >> Planning Time: 0.385 ms >> Execution Time: 4713.325 ms >> ``` >> >> >> >> >> >> >> On Thu, Feb 5, 2026 at 6:59=E2=80=AFAM Alexandre Felipe < >> o.alexandre.felipe@gmail.com> wrote: >> >>> Thank you for looking into this. >>> >>> Now we can execute a, still narrow, family queries! >>> >>> Maybe it helps to see this as a *social network feeds*. Imagine a >>> social network, you have a few friends, or follow a few people, and you >>> want to see their updates ordered by date. For each user we have a >>> different combination of users that we have to display. But maybe, even >>> having hundreds of users we will only show the first 10. >>> >>> There is a low hanging fruit on the skip scan, if we need N rows, and >>> one group already has M rows we could stop there. >>> If Nx is the number of friends, and M is the number of posts to show. >>> This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort, >>> instead of (Nx * N) followed by an (Nx * N) sort. >>> Where M =3D 10 and N is 1000 this is a significant improvement. >>> But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + >>> Nx) heap operations. >>> If everything is on the same page the skip scan would win. >>> >>> The cost estimation is probably far off. >>> I am also not considering the filters applied after this operator, and = I >>> don't know if the planner infrastructure is able to adjust it by itself= . >>> This is where I would like reviewer's feedback. I think that the planne= r >>> costs are something to be determined experimentally. >>> >>> Next I will make it slightly more general handling >>> * More index columns: Index (a, b, s...) could support WHERE a IN (...) >>> ORDER BY b LIMIT N (ignoring s...) >>> * Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c >>> * Non-leading prefix: WHERE b IN (...) AND a =3D const ORDER BY c on in= dex >>> (a, b, c) >>> >>> --- >>> Kind Regards, >>> Alexandre >>> >>> On Wed, Feb 4, 2026 at 7:13=E2=80=AFAM Micha=C5=82 K=C5=82eczek >>> wrote: >>> >>>> >>>> >>>> On 3 Feb 2026, at 22:42, Ants Aasma wrote: >>>> >>>> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra wrote: >>>> >>>> I'm also wondering how common is the targeted query pattern? How commo= n >>>> it is to have an IN condition on the leading column in an index, and >>>> ORDER BY on the second one? >>>> >>>> >>>> I have seen this pattern multiple times. My nickname for it is the >>>> timeline view. Think of the social media timeline, showing posts from >>>> all followed accounts in timestamp order, returned in reasonably sized >>>> batches. The naive SQL query will have to scan all posts from all >>>> followed accounts and pass them through a top-N sort. When the total >>>> number of posts is much larger than the batch size this is much slower >>>> than what is proposed here (assuming I understand it correctly) - >>>> effectively equivalent to running N index scans through Merge Append. >>>> >>>> >>>> My workarounds I have proposed users have been either to rewrite the >>>> query as a UNION ALL of a set of single value prefix queries wrapped >>>> in an order by limit. This gives the exact needed merge append plan >>>> shape. But repeating the query N times can get unwieldy when the >>>> number of values grows, so the fallback is: >>>> >>>> SELECT * FROM unnest(:friends) id, LATERAL ( >>>> SELECT * FROM posts >>>> WHERE user_id =3D id >>>> ORDER BY tstamp DESC LIMIT 100) >>>> ORDER BY tstamp DESC LIMIT 100; >>>> >>>> The downside of this formulation is that we still have to fetch a >>>> batch worth of items from scans where we otherwise would have only had >>>> to look at one index tuple. >>>> >>>> >>>> GIST can be used to handle this kind of queries as it supports multipl= e >>>> sort orders. >>>> The only problem is that GIST does not support ORDER BY column. >>>> One possible workaround is [1] but as described there it does not play >>>> well with partitioning. >>>> I=E2=80=99ve started drafting support for ORDER BY column in GIST - se= e [2]. >>>> I think it would be easier to implement and maintain than a new IAM >>>> (but I don=E2=80=99t have enough knowledge and experience to implement= it myself) >>>> >>>> [1] >>>> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1E= C21F%40kleczek.org >>>> [2] >>>> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5= DC91%40kleczek.org >>>> >>>> =E2=80=94 >>>> Kind regards, >>>> Michal >>>> >>> --000000000000f826dc064d379ae2 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Happy St. Patrick's day!
<= div>

Based on what I said said in previous emails = I see alternative
proposals

#1 Make it s= impler by not changing the index access methods.

#= 2=C2=A0Make it optimal in some sense by not using generic index searches
and not keeping multiple open index scans.

and=C2=A0
#3 Follow the pragmatic approach
=
Objective is, minimize the number of heap fetches.
As = high level as possible, reusing existing functions
instead of wri= ting custom code when possible.


Ants Aasma & Tomas Vondra
>= My workarounds I have proposed users have been either to rewrite the
> query as a UNION ALL of a set of single value prefix queries wrapped > in an order by limit. This gives the exact needed merge append plan > shape. But repeating the query N times can get unwieldy when the
> number of values grows, so the fallback is:
>
> SELECT * FROM unnest(:friends) id, LATERAL (
>=C2=A0 =C2=A0 =C2=A0SELECT * FROM posts
>=C2=A0 =C2=A0 =C2=A0WHERE user_id =3D id
>=C2=A0 =C2=A0 =C2=A0ORDER BY tstamp DESC LIMIT 100)
> ORDER BY tstamp DESC LIMIT 100;
>
> The downside of this formulation is that we still have to fetch a
> batch worth of items from scans where we otherwise would have only had=
> to look at one index tuple.
>=C2=A0
True. It's useful to think about the query this way, and it may be
better than full select + sort, but it has issues too.

An issue with this query is generality, if this is joined w= ith other
queries we can't determine in advance the limit.
=C2=A0

> The main problem I can see is that at planning time the cardinality of=
> the prefix array might not be known, and in theory could be in the
> millions. Having millions of index scans open at the same time is not<= br> > viable, so the method needs to somehow degrade gracefully. The idea I<= br> > had is to pick some limit, based on work_mem and/or benchmarking, and<= br> > one the limit is hit, populate the first batch and then run the next > batch of index scans, merging with the first result. Or something like=
> that, I can imagine a few different ways to handle it with different > tradeoffs.
>

Doesn't the proposed merge scan have a similar issue? Because that will=
also have to keep all the index scans open (even if only internally).
Indeed, it needs to degrade gracefully, in some way.
=C2= =A0
It is true, but I think we can trust the planner.
T= his problem scales=C2=A0similarly in a=C2=A0memoize node.
Is ~24k= B for each open index scan a good guess?

ALTERNATI= VE #1 - More efficient

Or to avoid having N open i= ndex scans we could=C2=A0 (??)
(1) find the index page for the he= ad of each prefix.
(2) for each prefix
(2.a) load tuples from = each head page
(2.b) if we consume the last tuple in a page save = a pointer
to the next page.
(2.c) check if tuples for t= he next prefix are in the same page
(2.d) Release the page.
=
(3) producing tuples in the suffix order
=C2=A0 (3.b) when t= uples for prefix are exhausted load load
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 page from (2.b)

Step 2.a. would possibl= y waste time extracting tuples that are=C2=A0
not needed, and mem= ory by storing them. Not sure how efficient
this can be compared = to having an open index scan.

Matthias van de Meen= t, Feb 3
> btree=C2=A0index=C2= =A0skip scan infrastructure efficiently prevents new=C2=A0index
> descents into the=C2=A0index=C2=A0when the selected SAOP key ranges are directly> adjecent, while merge scan would generally do at least one=C2=A0index
> descent for each of its N scan he= ads (*) - which in the proposed
> prototype patch guarantees O(index=C2=A0depth * num scan heads) buffer
>= accesses.

This could also be addressed if we do t= his custom descent,
I didn't bother about that depth factor b= ecause with a few random prefixes
doing so we are probably going = to save accesses only for the top level.

I wo= uld prefer to start with a very conceptual implementation
that ca= n already provide 1000x speedup, but if you think this
way is bet= ter, I am open to try it. I think this can be done
without affect= ing the planner logic and the PrefixJoin node.


I'm afra= id the
proposed batches execution will be rather complex, so I'd say v1 should=
simply have a threshold, and do the full scan + sort for more items.

Do you mean by an executor node that performs the query as i= f it was written

ALTERNATIVE #2 - Simpler(??)
for each _prefix of prefixes:
=C2=A0 result=C2=A0+=3D (SELECT FROM table=C2= =A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 W= HERE prefix =3D _prefix AND qual(*)=C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ORDER BY suffix=C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 LIMIT N)
=
return SELECT * FROM result=C2=A0
=C2=A0 =C2=A0 ORDER BY suffix=C2=A0
=C2=A0 =C2=A0 LIMIT N
This query may have to produce N * len(prefixes) rows, while t= he=C2=A0
original proposal would produce only N=C2=A0+ len(prefix= es) - 1.

Comparing to the previous results,
<= div>> | Method =C2=A0 =C2=A0 | Sha= red Hit | Shared Read | Exec Time |
> |------------|-----------:|--= ----------:|----------:|
= > | Merge =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 13 | =C2=A0 = =C2=A0 =C2=A0 =C2=A0 119 | =C2=A0 =C2=A0 13 ms |
> | IndexScan =C2= =A0| =C2=A0 =C2=A0 15,308 | =C2=A0 =C2=A0 525,310 | =C2=A03,409 ms |=

This Prefix Batch Scan approach
hit=3D62 read=3D773,=C2=A0Execution Time: 80.815 ms
With 8 prefixes, the execution time= increased to 80.8/13 ~ 6.21
And the number of buffers by 835/132= ~ 6.32

> I can imagine that this would really nicely benefit from ReadStream= 9;ification.
>

Not sure, maybe.

Actually as I was watc= hing the index prefetch development I was
quite uncertain about h= ow this would play with that, but we can
probably simply give a b= udget for each stream.


> One other connection I see is with block nested loops. In a perfect > future PostgreSQL could run the following as a set of merged index
> scans that terminate early:
>
> SELECT posts.*
> FROM follows f
>=C2=A0 =C2=A0 =C2=A0JOIN posts p ON f.followed_id =3D p.user_id
> WHERE f.follower_id =3D :userid
> ORDER BY p.tstamp DESC LIMIT 100;
>
> In practice this is not a huge issue - it's not that hard to trans= form
> this to array_agg and =3D ANY subqueries.
>=C2=A0
Automating that transformation seems quite non-trivial (to me).

Well, not trivial. To give a rough idea.

wc -l *.patch
=C2=A0 =C2=A0 =C2=A0113 v2-0001-Test-the-baseline.patch
=C2= =A0 =C2=A0 =C2=A0614 v2-0002-Access-method.patch
=C2=A0 =C2=A0 =C2=A0850= v2-0003-Planner-integration.patch
=C2=A0 =C2=A0 1958 v2-0004-Multi-colu= mn.patch
=C2=A0 =C2=A0 2439 v2-0005-Joins.patch

it is missing some important details like prefix deduplication
but for the scenario where the values on the other table
ar= e known to be unique it is good.

The multi column = accepts things like A in (...) B in (...)
and computes the cartes= ian product or (A, B) IN (...)


Regards,
Alexandre

On Mon, Feb 23, 2026 at 10:08=E2=80=AFPM= Alexandre Felipe <o.ale= xandre.felipe@gmail.com> wrote:
Hi Hackers,

Do you think that MERGE-SCAN was a terrible = name? I wanted a name that wouldn't=C2=A0

require much explanation. I named it like th= is because it relies on a k-way=C2=A0

merge to combine several segments of an inde= x in one result.=C2=A0But we already=C2=A0

have a MERGE statement. Even in the example = plan above we can see an=C2=A0external=C2=A0

merge that has nothing to do with the new fe= ature, and now as I am doing joins,=C2=A0

I started doing it on the NestedLoop trying = to follow the same conditions that=C2=A0

lead to a memoize. But I added so many field= s to the NestedLoop state that I=C2=A0

think it is good to have a separate=C2=A0str= ucture, and maybe a separate node, and=C2=A0

MergeScan of course is taken hehe. I was thi= nking of IndexPrefixMerge. We could=C2=A0

use the Ants nickname TimeLineScan, but of c= ourse it is not limited to time=C2=A0

lines (even though realistically, that will = probably be the most common use of=C2=A0

this). Another one I considered was Transpos= edIndexScan, because it orders=C2=A0

output on (suffix, prefix) instead of (prefi= x, suffix).





On Fri, Feb 6, 2026= at 10:52=E2=80=AFAM Alexandre Felipe <o.alexandre.felipe@gmail.com> wrote= :
Hello again hackers!

+pt@bowt.ie: That seems to be the one that is probab= ly the most familiar with the index scan (based on the commits).
= +michael@paquier.xyz=C2=A0,=C2=A0+tgl@sss.pgh.pa.us=C2=A0,=C2=A0+= peter@eisentraut.org=C2=A0as the top 3 committers to nbtree over the la= st ~6 months.

I=C2=A0 have made substantial=C2= =A0progress on adding a few features. I have questions, but I will let you = go first :)

Motivation:
In technical terms: this proposal is to take advantage of a = btree index when the query is filtered by a few distinct prefixes and order= ed by a suffix and has a limit.
In non technical: This= could help to efficiently render=C2=A0a social network feed, where each us= er can select a list of users whose posts they want to see, and the posts m= ust be ordered from newest to oldest.

Performance Comparison

I did a test with a toy ta= ble, please find more details below.

With limit 100

| Method =C2=A0 =C2=A0 | Shared Hit | Shared Read | Exec Ti= me |
|------------|-----------:|------------:|----------:|
| Merge = =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 13 | =C2=A0 =C2=A0 =C2=A0= =C2=A0 119 | =C2=A0 =C2=A0 13 ms |
| IndexScan =C2=A0| =C2=A0 =C2=A0 15= ,308 | =C2=A0 =C2=A0 525,310 | =C2=A03,409 ms |


With limit 1,= 000,000

| Method =C2=A0 =C2=A0 | =C2=A0Shar= edHit | SharRead | Temp I | Temp O | Exec Time |
|------------|---------= --:|---------:|-------:|-------:|----------:|
| Merge =C2=A0 =C2=A0 =C2= =A0| =C2=A0 =C2=A0980,318 | =C2=A0 19,721 | =C2=A0 =C2=A0 =C2=A00 | =C2=A0 = =C2=A0 =C2=A00 | =C2=A02,128 ms |
| Sequential | =C2=A0 =C2=A0 15,208 | = =C2=A0525,410 | 20,207 | 35,384 | =C2=A03,762 ms |
| Bitmap =C2=A0 =C2= =A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0629 | =C2=A0113,759 | 20,207 | 35,385 | = =C2=A05,487 ms |
| IndexScan =C2=A0| =C2=A07,880,619 | =C2=A0126,706 | 2= 0,945 | 35,386 | =C2=A05,874 ms |


Sequential scans and bitmap= scans in this case reduces significantly the number of
accessed buff b= ecause the table has only four integer columns, and these methods
can r= ead all the lines on a given page at a time.

However that comes at t= he cost of resorting to an in-disk sort method.
For the query with limi= t 100 we get no temp files as we are using a
top-100 sort.
make check passes


Experim= ent details

Consider a 100M row table formed (a,b,c,d) \i= n 100 x 100 x 100 x 100


```sql
CREA= TE TABLE grid AS (
=C2=A0 =C2=A0 SELECT a, b, c, d, FROM
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 generate_series(1, 100) AS a,
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 generate_series(1, 100) AS b,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 genera= te_series(1, 100) AS c,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 generate_series(1, = 100) AS d
);

CREATE INDEX grid_index ON grid (a, b, c);
ANALYS= E grid;

```

Now let's say that we need to find certain= number of rows filtered by a and ordered by b;
```sql
PREPARE grid_query(int) AS
SELECT sum(d) FROM (
=C2=A0 =C2= =A0 SELECT * FROM grid
=C2=A0 =C2=A0 WHERE a IN (2,3,5,8,13,21,34,55) A= ND b >=3D 0
=C2=A0 =C2=A0 ORDER BY b
=C2=A0 =C2=A0 LIMIT $1) t;<= /font>
```

---


Now with limit 100, with index merge sc= an (notice Index Prefixes in the plan).

```sql
SET enable_indexme= rgescan =3D on;
EXPLAIN (ANALYSE) EXECUTE grid_query(100);
```
```text
=C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D119
=C2=A0 =C2= =A0-> =C2=A0Limit =C2=A0(cost=3D0.57..87.29 rows=3D100 width=3D16) (actu= al time=3D5.528..12.999 rows=3D100.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D119
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0-> =C2=A0Index Scan using grid_a_b_c_idx on grid =C2=A0(cos= t=3D0.57..93.36 rows=3D107 width=3D16) (actual time=3D5.528..12.994 rows=3D= 100.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0Index Cond: (b >=3D 0)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0Index Prefixes: (a =3D ANY ('{2,3,5,8,13,21,34,55}&= #39;::integer[]))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0Index Searches: 8
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0Buffers: shared hit=3D13 read=3D119
=C2=A0Planning:
=C2=A0 =C2= =A0Buffers: shared hit=3D59 read=3D23
=C2=A0Planning Time: 4.619 ms
= =C2=A0Execution Time: 13.055 ms
=C2=A0```


```sql
SET enabl= e_indexmergescan =3D off;
EXPLAIN (ANALYSE) EXECUTE grid_query(100);
= ```

```text
=C2=A0Aggregate =C2=A0(cost=3D1603588.06..1603588.07 = rows=3D1 width=3D8) (actual time=3D3406.624..3408.710 rows=3D1.00 loops=3D1= )
=C2=A0 =C2=A0Buffers: shared hit=3D15308 read=3D525310
=C2=A0 =C2= =A0-> =C2=A0Limit =C2=A0(cost=3D1603575.17..1603586.81 rows=3D100 width= =3D16) (actual time=3D3406.601..3408.702 rows=3D100.00 loops=3D1)
=C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15308 read=3D525310
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Gather Merge =C2=A0(cost=3D16= 03575.17..2514342.92 rows=3D7819999 width=3D16) (actual time=3D3406.598..34= 08.695 rows=3D100.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0Workers Planned: 2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0Workers Launched: 2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15308 read=3D525310
=C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Sort =C2=A0(co= st=3D1602575.14..1610720.98 rows=3D3258333 width=3D16) (actual time=3D3393.= 782..3393.784 rows=3D100.00 loops=3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: grid.b
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Meth= od: top-N heapsort =C2=A0Memory: 32kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15308 read= =3D525310
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0Worker 0: =C2=A0Sort Method: top-N heapsort =C2=A0Memory: 32k= B
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0Worker 1: =C2=A0Sort Method: top-N heapsort =C2=A0Memory: 32kB
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&= gt; =C2=A0Parallel Seq Scan on grid =C2=A0(cost=3D0.00..1478044.00 r= ows=3D3258333 width=3D16) (actual time=3D0.944..3129.896 rows=3D2666666.67 = loops=3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Filter: ((b >=3D 0) AND (a =3D ANY= ('{2,3,5,8,13,21,34,55}'::integer[])))
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Ro= ws Removed by Filter: 30666667
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit= =3D15234 read=3D525310
=C2=A0Planning Time: 0.370 ms
=C2=A0Execution = Time: 3409.134 ms
=C2=A0```

Now queries with limit 1,000,000
<= br>```sql
SET enable_indexmergescan =3D on;
EXPLAIN ANALYSE EXECUTE g= rid_query(1000000);
```

Query executed with the proposed access m= ethod. Notice in the plan Index Prefixes and Index Cond.
```text
=C2= =A0 =C2=A0Buffers: shared hit=3D980318 read=3D19721
=C2=A0 =C2=A0-> = =C2=A0Limit =C2=A0(cost=3D0.57..867259.84 rows=3D1000000 width=3D16) (actua= l time=3D2.854..2103.438 rows=3D1000000.00 loops=3D1)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0Buffers: shared hit=3D980318 read=3D19721
=C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0-> =C2=A0Index Scan using grid_a_b_c_idx on grid = =C2=A0(cost=3D0.57..867265.91 rows=3D1000007 width=3D16) (actual time=3D2.8= 52..2066.205 rows=3D1000000.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: (b >=3D 0)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Prefixes: (a =3D ANY (&#= 39;{2,3,5,8,13,21,34,55}'::integer[]))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Searches: 8
=C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D980318 read=3D19721
= =C2=A0Planning Time: 0.328 ms
=C2=A0Execution Time: 2127.811 ms
=C2= =A0```

If we disable index_mergescan we naturally we fall into a seq= uential scan.

```sql
SET enable_indexmergescan =3D off;
EXPLAI= N ANALYSE EXECUTE grid_query(1000000);
```
```text
=C2=A0 =C2=A0Bu= ffers: shared hit=3D15208 read=3D525410, temp read=3D20207 written=3D35384<= br>=C2=A0 =C2=A0-> =C2=A0Limit =C2=A0(cost=3D1942895.64..2059362.12 rows= =3D1000000 width=3D16) (actual time=3D3467.012..3712.044 rows=3D1000000.00 = loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15208= read=3D525410, temp read=3D20207 written=3D35384
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0-> =C2=A0Gather Merge =C2=A0(cost=3D1942895.64..2853663.39 = rows=3D7819999 width=3D16) (actual time=3D3467.010..3671.220 rows=3D1000000= .00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Wo= rkers Planned: 2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= Workers Launched: 2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207 written=3D3= 5384
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0= Sort =C2=A0(cost=3D1941895.62..1950041.45 rows=3D3258333 width=3D16) (actua= l time=3D3455.852..3476.358 rows=3D334576.33 loops=3D3)
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: gri= d.b
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0Sort Method: external merge =C2=A0Disk: 47016kB
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffer= s: shared hit=3D15208 read=3D525410, temp read=3D20207 written=3D35384
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0Worker 0: =C2=A0Sort Method: external merge =C2=A0Disk: 46976kB
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Wo= rker 1: =C2=A0Sort Method: external merge =C2=A0Disk: 47000kB
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2= =A0Parallel Seq Scan on grid =C2=A0(cost=3D0.00..1478044.00 rows=3D3= 258333 width=3D16) (actual time=3D2.789..2779.483 rows=3D2666666.67 loops= =3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Filter: ((b >=3D 0) AND (a =3D ANY ('= {2,3,5,8,13,21,34,55}'::integer[])))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Rows Remo= ved by Filter: 30666667
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D1513= 4 read=3D525410
=C2=A0Planning Time: 0.332 ms
=C2=A0Execution Time: 3= 761.866 ms
```
=C2=A0
If we disable sequential scans, then we get = a bitmap scan

```sql
SET enable_seqscan =3D off;
EXPLAIN ANALY= SE EXECUTE grid_query(1000000);
```
```text
=C2=A0 =C2=A0Buffers: = shared hit=3D629 read=3D113759 written=3D2, temp read=3D20207 written=3D353= 85
=C2=A0 =C2=A0-> =C2=A0Limit =C2=A0(cost=3D1998199.78..2114666.26 r= ows=3D1000000 width=3D16) (actual time=3D5170.456..5453.433 rows=3D1000000.= 00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D62= 9 read=3D113759 written=3D2, temp read=3D20207 written=3D35385
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Gather Merge =C2=A0(cost=3D1998199.7= 8..2908967.53 rows=3D7819999 width=3D16) (actual time=3D5170.455..5413.254 = rows=3D1000000.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0Workers Planned: 2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0Workers Launched: 2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D629 read=3D113759 written=3D2, te= mp read=3D20207 written=3D35385
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0-> =C2=A0Sort =C2=A0(cost=3D1997199.75..2005345.59 rows= =3D3258333 width=3D16) (actual time=3D5156.929..5177.507 rows=3D334500.67 l= oops=3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0Sort Key: grid.b
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Method: external merge =C2=A0Disk: = 47032kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read= =3D20207 written=3D35385
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Method: external merge = =C2=A0Disk: 47280kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Sort Method: external merge =C2=A0D= isk: 46680kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0-> =C2=A0Parallel Bitmap Heap Scan on grid =C2=A0(cost= =3D107691.54..1533348.13 rows=3D3258333 width=3D16) (actual time=3D299.891.= .4489.787 rows=3D2666666.67 loops=3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Recheck C= ond: ((a =3D ANY ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= =3D 0))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Rows Removed by Index Recheck: 241= 0242
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Heap Blocks: exact=3D13100 lossy=3D22639
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D615 read=3D113759 written=3D2=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Heap Blocks: exact=3D13077 lossy= =3D22755
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Heap Blocks: exact=3D1303= 6 lossy=3D22421
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Bitmap Index Scan on grid_a_b_c_idx =C2=A0(cost=3D0.00..105736.54 rows=3D7820000 width=3D0= ) (actual time=3D297.651..297.651 rows=3D8000000.00 loops=3D1)
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: ((a =3D ANY ('{2,3,5,= 8,13,21,34,55}'::integer[])) AND (b >=3D 0))
=C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0Index Searches: 7
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D7293 written=3D2
=C2=A0= Planning Time: 0.165 ms
=C2=A0Execution Time: 5487.213 ms
```

= If we disable bitmap scans we finally get an index scan

```sql
SE= T enable_bitmapscan =3D off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000)= ;
```
```
=C2=A0 =C2=A0Buffers: shared hit=3D7883221 read=3D124111= , temp read=3D20699 written=3D35385
=C2=A0 =C2=A0-> =C2=A0Limit =C2= =A0(cost=3D7201203.08..7317669.55 rows=3D1000000 width=3D16) (actual time= =3D4414.478..4674.400 rows=3D1000000.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0Buffers: shared hit=3D7883221 read=3D124111, temp read=3D20699= written=3D35385
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Gather Me= rge =C2=A0(cost=3D7201203.08..8111970.83 rows=3D7819999 width=3D16) (actual= time=3D4414.476..4633.982 rows=3D1000000.00 loops=3D1)
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Planned: 2
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Launched: 2
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D78832= 21 read=3D124111, temp read=3D20699 written=3D35385
=C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Sort =C2=A0(cost=3D7200203.0= 5..7208348.88 rows=3D3258333 width=3D16) (actual time=3D4390.625..4411.896 = rows=3D334567.00 loops=3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: grid.b
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Method: exte= rnal merge =C2=A0Disk: 47304kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D7883221 read= =3D124111, temp read=3D20699 written=3D35385
=C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Metho= d: external merge =C2=A0Disk: 47304kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Sort Method: exte= rnal merge =C2=A0Disk: 46384kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Parallel Index Scan u= sing grid_a_b_c_idx on grid =C2=A0(cost=3D0.57..6736351.43 rows=3D3258333 w= idth=3D16) (actual time=3D46.925..3796.915 rows=3D2666666.67 loops=3D3)
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0Index Cond: ((a =3D ANY ('{2,3,5,8,13,21,34,55}= '::integer[])) AND (b >=3D 0))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Search= es: 7
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D7883208 read=3D124110<= br>=C2=A0Planning Time: 0.385 ms
=C2=A0Execution Time: 4713.325 ms
= =C2=A0```

=C2=A0




On Thu, Feb 5, 2026 at 6:59=E2=80=AFAM Alexandre Felipe <o.alexandre= .felipe@gmail.com> wrote:
Thank you for looking into this.

=
Now we can execute a, still narrow, family queries!
Maybe it helps to see this as a=C2=A0social network=C2=A0fe= eds. Imagine a social network, you have a few friends, or follow a few = people, and you want to see their updates ordered by date. For each user we= have a different combination of users that we have to display. But maybe, = even having hundreds of users we will only show the first 10.
There is a low hanging fruit on the skip scan, if we need N row= s, and one group already has M rows we could stop there.
If Nx is= the number of friends, and M is the number of posts to show.
Thi= s runs with complexity (Nx * M) rows, followed by an (Nx * M) sort, instead= of (Nx * N) followed by an (Nx * N) sort.
Where M =3D 10 and N i= s 1000 this is a significant improvement.
But if M ~ N, the merge= scan=C2=A0that runs with M=C2=A0+ Nx row accesses, (M=C2=A0+ Nx) heap oper= ations.
If everything is on the same page the skip scan would win= .

The cost estimation is probably far off.
I am also not considering the filters=C2=A0applied after this=C2=A0opera= tor,=C2=A0and I don't know if the planner infrastructure is able to adj= ust it by itself.
This is where I would like reviewer's feedb= ack. I think that the planner costs are something to be determined experime= ntally.

Next I will make it slightly more general = handling
* More index columns: Index (a, b, s...) could support WHERE a = IN (...) ORDER BY b LIMIT N (ignoring s...)
* Multi-column prefix: WHERE= (a, b) IN (...) ORDER BY c
* Non-leading prefix: WHERE b IN (...) AND a= =3D const ORDER BY c on index (a, b, c)

---
=
Kind Regards,
Alexandre

On Wed, Feb 4, 2026 at 7:13=E2= =80=AFAM Micha=C5=82 K=C5=82eczek <michal@kleczek.org> wrote:


On 3 Feb 2026, at 22:42, Ants Aasma <ants.aasma@cybertec.at> wrote:

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra &= lt;tomas@vondra.me= > wrote:
I'm also wondering how common = is the targeted query pattern? How common
it is to have an IN condition = on the leading column in an index, and
ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it = is the
timeline view. Think of the social media timeline, showing posts = from
all followed accounts in timestamp order, returned in reasonably si= zed
batches. The naive SQL query will have to scan all posts from allfollowed accounts and pass them through a top-N sort. When the total
nu= mber of posts is much larger than the batch size this is much slower
tha= n what is proposed here (assuming I understand it correctly) -
effective= ly equivalent to running N index scans through Merge Append.

My workarounds I have pro= posed users have been either to rewrite the
query as a UNION ALL of a se= t of single value prefix queries wrapped
in an order by limit. This give= s the exact needed merge append plan
shape. But repeating the query N ti= mes can get unwieldy when the
number of values grows, so the fallback is= :

SELECT * FROM unnest(:friends) id, LATERAL (
=C2=A0=C2=A0=C2= =A0SELECT * FROM posts
=C2=A0=C2=A0=C2=A0WHERE user_id =3D id
=C2= =A0=C2=A0=C2=A0ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMI= T 100;

The downside of this formulation is that we still have to fet= ch a
batch worth of items from scans where we otherwise would have only = had
to look at one index tuple.

GIST can be used to handle this kind of queries as it supports mult= iple sort orders.
The only problem is that GIST does not support = ORDER BY column.
One possible workaround is [1] but as described = there it does not play well with partitioning.
I=E2=80=99ve start= ed drafting support for ORDER BY column in GIST - see [2].
I thin= k it would be easier to implement and maintain than a new IAM (but I don=E2= =80=99t have enough knowledge and experience to implement it myself)
<= div>

=E2=80=94
Kind regards,
Mic= hal
--000000000000f826dc064d379ae2--