Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vue6i-00HHKy-1A for pgsql-hackers@arkaria.postgresql.org; Mon, 23 Feb 2026 22:08:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vue6h-00FhDw-14 for pgsql-hackers@arkaria.postgresql.org; Mon, 23 Feb 2026 22:08:47 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vue6g-00FhDj-1g for pgsql-hackers@lists.postgresql.org; Mon, 23 Feb 2026 22:08:47 +0000 Received: from mail-ed1-x533.google.com ([2a00:1450:4864:20::533]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vue6b-00000000rWt-2lOK for pgsql-hackers@postgresql.org; Mon, 23 Feb 2026 22:08:45 +0000 Received: by mail-ed1-x533.google.com with SMTP id 4fb4d7f45d1cf-65c01595082so7074693a12.3 for ; Mon, 23 Feb 2026 14:08:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1771884521; cv=none; d=google.com; s=arc-20240605; b=eSL/0qRvr/4UXOUuCgPF0EbiM1zYeqF9ilTcF2M4+bbbeYHg3GBJ9xmesJJE8EUCTH fkBijyaaaV9tzunLFuQjpvIk3CF25ZjXjvUI4ESMhFJwFSE8odm+Na/jZZF0w71YkNnu I5fQFh+YIaWBZLTMt7vTXgLaDLj0XcWdsgebXafU5Uxi/FGBQWUdCYoFFR7fyM6gR2+M NMGH05bH3dkAkH+caOpOikS4fA/06u24E6TeqZRd5QrmXcNUfXvHUiCxTbZJxgv3SZPh TzbsS1NcXoKBZJz6wJ3DOLiRh5i0N0dK/YbZfFvvCqo0bArBlJSqDnpsdSOMa4NqZkrC 0/yQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=9AiPztkwkE92LVy6+cCGjsaayicddGXLyN7ynMsAWIU=; fh=0TjJ3tmIly6MqaDvTlGk0yBHDu6xXpMxxqg7tOC6tKM=; b=ilGcodQWzAOaUkxoa80jOLceRcwqGNqGU23/082elCFCG0pk1ebAEQLt/DpOvMTe+9 nP0QLp0kQzFx6TbtdDuRz26mvEn6XHwKDdiLYL53wvgq6WHQ3L6i4cLpupX8z3V9XcM5 3Y8EJqTIQwv1Urz2W+LFaCDAtNmlN2NM4ontLQOYbouvHiunm2vFbpFq5Nv+6hhsV3rn oesaL2VnKpj+ZYcYujUmGJVPvB8iOi7xlXxEHiLkIQAje7JW15a58i8JyKDEzG6ZWnak 0tLF/2XFwl+NKa5LRAA5IUprDJRBIXEKfOCOBHGh8HY3S//XwrBlhu+Jcsn270CBNzZJ 6mcw==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771884521; x=1772489321; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=9AiPztkwkE92LVy6+cCGjsaayicddGXLyN7ynMsAWIU=; b=BqITwaqd9gSXR5sGMj/FCzGU0bkYYHpDfd4e20UA9Wh9afx/RQqKtSXgSdlfXAvUOD Y3TfP5EaSplTx8d1V2raOtxP3dzmMW0wu0rR+BL5j4UH1nXKaNdP0y8lRS3qV/1ZIAPi B4S84ZVhY4FT6MtJunA46Wbi1sjwOmhvmvgzVzWvspoI2JngnGDZTiqGtyv5x0RtrzJs raj1HvhrEH01y+Udko1R9dV9C8kP8kzCxXFkGH/lPMHI0ooVQBi0fX+67CYT5Q68ICUi 5PliFoOb+PFApfSOU6moInc+zgHfvI4GILjDMl4N+XhYD0/pKAbnb0Dfwp7QC7Kf09u3 BDMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771884521; x=1772489321; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=9AiPztkwkE92LVy6+cCGjsaayicddGXLyN7ynMsAWIU=; b=m6Spj+TQZRI6GQNmXPiSHOb52yKujn2Iag925342Q8MVMTlV79jzddTw7fx3FZInx/ Y7X9pSUNxxqF7L13xzMYiepWIlJlfAIYmiaX21bdqrwbHPHX463X69usAUEGw9ocZhsw /pasvtr6DLiDtBz9HPxSVRMEV3i3CHg/JrI/z9N7PK0KiD/Ch43e1lQwm6CLy5JHXU0Z sCK24Ra+/s0CesXrV5oMkfJSZJUg2bvpAnWYABgzpwu8kGI0wkocjRAnYfCaaApa2iXT S5DjYmD/DQkSsniVg27r75lSliLUKs1F+rq/UZVHJ5oOQZOrwSxI75YjepRFGpvjibwd +JXg== X-Gm-Message-State: AOJu0YwYdlAGKjmNcAUBc0tP+S2RmPH/xQJKadc/e2eHX+ipRAPMGx3s dXRMoDMZm5z6Aq2kzZMxAbsM6PipvK4+SaomuoR2oaJ/ttOJxSfcfHhEgcs/Ac+tTUtPjDeE380 rIm9bLgxv9U0EbftkVXi7yhX7JCbjUZayKKP+gkg= X-Gm-Gg: ATEYQzyH/v7rojsiqQBSV7BrZoq5hTwTvl4h8xriPILQ2R28anUeHNlLlQaCov3PIO5 G/VqyVuVmER5LbwfI6LWsxwhwrzTuoHaE9bw7axR2K+3nudVar9hLHbn1pJyBIyUmEPqdHrPdM5 KraD53j2Pr/LyM5i7MW0n709kAeGNtWp3AS1NkdQQKO6CtB2RdgTTFOJ1OK0JGB6PTsz63Q9QQZ mJ98slNA5dfJxaEiT2tcPNCcyMUX3nsc8wMlQKVzod0RoEkR/ybKapxHb/M7030ZpZnxK0HXju7 oFesGaFIbkIakhhxR32tGcHByPBnW+Ze5MYkbZu3XN0gXic9Nw== X-Received: by 2002:a05:6402:13c3:b0:65a:4209:f83b with SMTP id 4fb4d7f45d1cf-65ea57d3ce8mr5841477a12.29.1771884520648; Mon, 23 Feb 2026 14:08:40 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Alexandre Felipe Date: Mon, 23 Feb 2026 22:08:29 +0000 X-Gm-Features: AaiRm522klm2bwtdy5VFW00EhvMMJb6yn9yURKEVOrzGD3CJuOCqgFNm6VQu2N8 Message-ID: Subject: Re: New access method for b-tree. To: "pgsql-hackers@postgresql.org" , michael@paquier.xyz, tgl@sss.pgh.pa.us, "peter@eisentraut.org" Cc: Ants Aasma , Tomas Vondra , Alexandre Felipe , =?UTF-8?B?TWljaGHFgiBLxYJlY3plaw==?= , pg@bowt.ie Content-Type: multipart/alternative; boundary="0000000000006fb55f064b8503b1" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000006fb55f064b8503b1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Hackers, Do you think that MERGE-SCAN was a terrible name? I wanted a name that wouldn't require much explanation. I named it like this because it relies on a k-way merge to combine several segments of an index in one result. But we already have a MERGE statement. Even in the example plan above we can see an external merge that has nothing to do with the new feature, and now as I am doing joins, I started doing it on the NestedLoop trying to follow the same conditions that lead to a memoize. But I added so many fields to the NestedLoop state that = I think it is good to have a separate structure, and maybe a separate node, and MergeScan of course is taken hehe. I was thinking of IndexPrefixMerge. We could use the Ants nickname TimeLineScan, but of course it is not limited to time lines (even though realistically, that will probably be the most common use of this). Another one I considered was TransposedIndexScan, because it orders output on (suffix, prefix) instead of (prefix, suffix). On Fri, Feb 6, 2026 at 10:52=E2=80=AFAM Alexandre Felipe < o.alexandre.felipe@gmail.com> wrote: > Hello again hackers! > > +pt@bowt.ie : That seems to be the one that is probably the > most familiar with the index scan (based on the commits). > +michael@paquier.xyz , +tgl@sss.pgh.pa.us > , +peter@eisentraut.org as the > top 3 committers to nbtree over the last ~6 months. > > I have made substantial progress on adding a few features. I have > questions, but I will let you go first :) > > Motivation: > *In technical terms:* this proposal is to take advantage of a btree index > when the query is filtered by a few distinct prefixes and ordered by a > suffix and has a limit. > *In non technical:* This could help to efficiently render a social > network feed, where each user can select a list of users whose posts they > want to see, and the posts must be ordered from newest to oldest. > > > *Performance Comparison* > I did a test with a toy table, please find more details below. > > With limit 100 > > | Method | Shared Hit | Shared Read | Exec Time | > |------------|-----------:|------------:|----------:| > | Merge | 13 | 119 | 13 ms | > | IndexScan | 15,308 | 525,310 | 3,409 ms | > > With limit 1,000,000 > > | Method | SharedHit | SharRead | Temp I | Temp O | Exec Time | > |------------|-----------:|---------:|-------:|-------:|----------:| > | Merge | 980,318 | 19,721 | 0 | 0 | 2,128 ms | > | Sequential | 15,208 | 525,410 | 20,207 | 35,384 | 3,762 ms | > | Bitmap | 629 | 113,759 | 20,207 | 35,385 | 5,487 ms | > | IndexScan | 7,880,619 | 126,706 | 20,945 | 35,386 | 5,874 ms | > > Sequential scans and bitmap scans in this case reduces significantly the > number of > accessed buff because the table has only four integer columns, and these > methods > can read all the lines on a given page at a time. > > However that comes at the cost of resorting to an in-disk sort method. > For the query with limit 100 we get no temp files as we are using a > top-100 sort. > > make check passes > > > *Experiment details* > > Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100 > > > ```sql > CREATE TABLE grid AS ( > SELECT a, b, c, d, FROM > generate_series(1, 100) AS a, > generate_series(1, 100) AS b, > generate_series(1, 100) AS c, > generate_series(1, 100) AS d > ); > > CREATE INDEX grid_index ON grid (a, b, c); > ANALYSE grid; > ``` > > Now let's say that we need to find certain number of rows filtered by a > and ordered by b; > ```sql > PREPARE grid_query(int) AS > SELECT sum(d) FROM ( > SELECT * FROM grid > WHERE a IN (2,3,5,8,13,21,34,55) AND b >=3D 0 > ORDER BY b > LIMIT $1) t; > ``` > > --- > > > Now with limit 100, with index merge scan (notice Index Prefixes in the > plan). > > ```sql > SET enable_indexmergescan =3D on; > EXPLAIN (ANALYSE) EXECUTE grid_query(100); > ``` > > ```text > Buffers: shared hit=3D13 read=3D119 > -> Limit (cost=3D0.57..87.29 rows=3D100 width=3D16) (actual > time=3D5.528..12.999 rows=3D100.00 loops=3D1) > Buffers: shared hit=3D13 read=3D119 > -> Index Scan using grid_a_b_c_idx on grid (cost=3D0.57..93.36 > rows=3D107 width=3D16) (actual time=3D5.528..12.994 rows=3D100.00 loops= =3D1) > Index Cond: (b >=3D 0) > *Index Prefixes: *(a =3D ANY > ('{2,3,5,8,13,21,34,55}'::integer[])) > Index Searches: 8 > Buffers: shared hit=3D13 read=3D119 > Planning: > Buffers: shared hit=3D59 read=3D23 > Planning Time: 4.619 ms > Execution Time: 13.055 ms > ``` > > > ```sql > SET enable_indexmergescan =3D off; > EXPLAIN (ANALYSE) EXECUTE grid_query(100); > ``` > > ```text > Aggregate (cost=3D1603588.06..1603588.07 rows=3D1 width=3D8) (actual > time=3D3406.624..3408.710 rows=3D1.00 loops=3D1) > Buffers: shared hit=3D15308 read=3D525310 > -> Limit (cost=3D1603575.17..1603586.81 rows=3D100 width=3D16) (actu= al > time=3D3406.601..3408.702 rows=3D100.00 loops=3D1) > Buffers: shared hit=3D15308 read=3D525310 > -> Gather Merge (cost=3D1603575.17..2514342.92 rows=3D7819999 > width=3D16) (actual time=3D3406.598..3408.695 rows=3D100.00 loops=3D1) > Workers Planned: 2 > Workers Launched: 2 > Buffers: shared hit=3D15308 read=3D525310 > -> Sort (cost=3D1602575.14..1610720.98 rows=3D3258333 > width=3D16) (actual time=3D3393.782..3393.784 rows=3D100.00 loops=3D3) > Sort Key: grid.b > Sort Method: top-N heapsort Memory: 32kB > Buffers: shared hit=3D15308 read=3D525310 > Worker 0: Sort Method: top-N heapsort Memory: 32kB > Worker 1: Sort Method: top-N heapsort Memory: 32kB > -> *Parallel Seq Scan* on grid > (cost=3D0.00..1478044.00 rows=3D3258333 width=3D16) (actual time=3D0.944= ..3129.896 > rows=3D2666666.67 loops=3D3) > Filter: ((b >=3D 0) AND (a =3D ANY > ('{2,3,5,8,13,21,34,55}'::integer[]))) > Rows Removed by Filter: 30666667 > Buffers: shared hit=3D15234 read=3D525310 > Planning Time: 0.370 ms > Execution Time: 3409.134 ms > ``` > > Now queries with limit 1,000,000 > > ```sql > SET enable_indexmergescan =3D on; > EXPLAIN ANALYSE EXECUTE grid_query(1000000); > ``` > > Query executed with the proposed access method. Notice in the plan Index > Prefixes and Index Cond. > ```text > Buffers: shared hit=3D980318 read=3D19721 > -> Limit (cost=3D0.57..867259.84 rows=3D1000000 width=3D16) (actual > time=3D2.854..2103.438 rows=3D1000000.00 loops=3D1) > Buffers: shared hit=3D980318 read=3D19721 > -> Index Scan using grid_a_b_c_idx on grid > (cost=3D0.57..867265.91 rows=3D1000007 width=3D16) (actual time=3D2.852.= .2066.205 > rows=3D1000000.00 loops=3D1) > Index Cond: (b >=3D 0) > *Index Prefixes:* (a =3D ANY > ('{2,3,5,8,13,21,34,55}'::integer[])) > Index Searches: 8 > Buffers: shared hit=3D980318 read=3D19721 > Planning Time: 0.328 ms > Execution Time: 2127.811 ms > ``` > > If we disable index_mergescan we naturally we fall into a sequential scan= . > > ```sql > SET enable_indexmergescan =3D off; > EXPLAIN ANALYSE EXECUTE grid_query(1000000); > ``` > ```text > Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207 written= =3D35384 > -> Limit (cost=3D1942895.64..2059362.12 rows=3D1000000 width=3D16) (= actual > time=3D3467.012..3712.044 rows=3D1000000.00 loops=3D1) > Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207 > written=3D35384 > -> Gather Merge (cost=3D1942895.64..2853663.39 rows=3D7819999 > width=3D16) (actual time=3D3467.010..3671.220 rows=3D1000000.00 loops=3D1= ) > Workers Planned: 2 > Workers Launched: 2 > Buffers: shared hit=3D15208 read=3D525410, temp read=3D202= 07 > written=3D35384 > -> Sort (cost=3D1941895.62..1950041.45 rows=3D3258333 > width=3D16) (actual time=3D3455.852..3476.358 rows=3D334576.33 loops=3D3) > Sort Key: grid.b > Sort Method: *external merge Disk: 47016kB* > Buffers: shared hit=3D15208 read=3D525410, temp > read=3D20207 written=3D35384 > Worker 0: Sort Method: external merge Disk: 46976k= B > Worker 1: Sort Method: external merge Disk: 47000k= B > -> *Parallel Seq Scan* on grid > (cost=3D0.00..1478044.00 rows=3D3258333 width=3D16) (actual time=3D2.789= ..2779.483 > rows=3D2666666.67 loops=3D3) > Filter: ((b >=3D 0) AND (a =3D ANY > ('{2,3,5,8,13,21,34,55}'::integer[]))) > Rows Removed by Filter: 30666667 > Buffers: shared hit=3D15134 read=3D525410 > Planning Time: 0.332 ms > Execution Time: 3761.866 ms > ``` > > If we disable sequential scans, then we get a bitmap scan > > ```sql > SET enable_seqscan =3D off; > EXPLAIN ANALYSE EXECUTE grid_query(1000000); > ``` > ```text > Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read=3D20207 > written=3D35385 > -> Limit (cost=3D1998199.78..2114666.26 rows=3D1000000 width=3D16) (= actual > time=3D5170.456..5453.433 rows=3D1000000.00 loops=3D1) > Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read= =3D20207 > written=3D35385 > -> Gather Merge (cost=3D1998199.78..2908967.53 rows=3D7819999 > width=3D16) (actual time=3D5170.455..5413.254 rows=3D1000000.00 loops=3D1= ) > Workers Planned: 2 > Workers Launched: 2 > Buffers: shared hit=3D629 read=3D113759 written=3D2, temp > read=3D20207 written=3D35385 > -> Sort (cost=3D1997199.75..2005345.59 rows=3D3258333 > width=3D16) (actual time=3D5156.929..5177.507 rows=3D334500.67 loops=3D3) > Sort Key: grid.b > Sort Method: external merge Disk: 47032kB > Buffers: shared hit=3D629 read=3D113759 written=3D2,= temp > read=3D20207 written=3D35385 > Worker 0: Sort Method: external merge Disk: 47280k= B > Worker 1: Sort Method: external merge Disk: 46680k= B > -> Parallel Bitmap Heap Scan on grid > (cost=3D107691.54..1533348.13 rows=3D3258333 width=3D16) (actual > time=3D299.891..4489.787 rows=3D2666666.67 loops=3D3) > Recheck Cond: ((a =3D ANY > ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0)) > Rows *Removed by Index Recheck*: 2410242 > Heap Blocks: exact=3D13100 lossy=3D22639 > Buffers: shared hit=3D615 read=3D113759 writte= n=3D2 > Worker 0: Heap Blocks: exact=3D13077 lossy=3D= 22755 > Worker 1: Heap Blocks: exact=3D13036 lossy=3D= 22421 > -> *Bitmap Index Scan* on grid_a_b_c_idx > (cost=3D0.00..105736.54 rows=3D7820000 width=3D0) (actual time=3D297.651= ..297.651 > rows=3D8000000.00 loops=3D1) > Index Cond: ((a =3D ANY > ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0)) > Index Searches: 7 > Buffers: shared hit=3D13 read=3D7293 wri= tten=3D2 > Planning Time: 0.165 ms > Execution Time: 5487.213 ms > ``` > > If we disable bitmap scans we finally get an index scan > > ```sql > SET enable_bitmapscan =3D off; > EXPLAIN ANALYSE EXECUTE grid_query(1000000); > ``` > ``` > Buffers: shared hit=3D7883221 read=3D124111, temp read=3D20699 written= =3D35385 > -> Limit (cost=3D7201203.08..7317669.55 rows=3D1000000 width=3D16) (= actual > time=3D4414.478..4674.400 rows=3D1000000.00 loops=3D1) > Buffers: shared hit=3D7883221 read=3D124111, temp read=3D20699 > written=3D35385 > -> Gather Merge (cost=3D7201203.08..8111970.83 rows=3D7819999 > width=3D16) (actual time=3D4414.476..4633.982 rows=3D1000000.00 loops=3D1= ) > Workers Planned: 2 > Workers Launched: 2 > Buffers: shared hit=3D7883221 read=3D124111, temp read=3D2= 0699 > written=3D35385 > -> Sort (cost=3D7200203.05..7208348.88 rows=3D3258333 > width=3D16) (actual time=3D4390.625..4411.896 rows=3D334567.00 loops=3D3) > Sort Key: grid.b > Sort Method: *external merge Disk: 47304kB* > Buffers: shared hit=3D7883221 read=3D124111, temp > read=3D20699 written=3D35385 > Worker 0: Sort Method: external merge Disk: 47304k= B > Worker 1: Sort Method: external merge Disk: 46384k= B > -> *Parallel Index Scan* using grid_a_b_c_idx on > grid (cost=3D0.57..6736351.43 rows=3D3258333 width=3D16) (actual > time=3D46.925..3796.915 rows=3D2666666.67 loops=3D3) > Index Cond: ((a =3D ANY > ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0)) > Index Searches: 7 > Buffers: shared hit=3D7883208 read=3D124110 > Planning Time: 0.385 ms > Execution Time: 4713.325 ms > ``` > > > > > > > On Thu, Feb 5, 2026 at 6:59=E2=80=AFAM Alexandre Felipe < > o.alexandre.felipe@gmail.com> wrote: > >> Thank you for looking into this. >> >> Now we can execute a, still narrow, family queries! >> >> Maybe it helps to see this as a *social network feeds*. Imagine a social >> network, you have a few friends, or follow a few people, and you want to >> see their updates ordered by date. For each user we have a different >> combination of users that we have to display. But maybe, even having >> hundreds of users we will only show the first 10. >> >> There is a low hanging fruit on the skip scan, if we need N rows, and on= e >> group already has M rows we could stop there. >> If Nx is the number of friends, and M is the number of posts to show. >> This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort, >> instead of (Nx * N) followed by an (Nx * N) sort. >> Where M =3D 10 and N is 1000 this is a significant improvement. >> But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx= ) >> heap operations. >> If everything is on the same page the skip scan would win. >> >> The cost estimation is probably far off. >> I am also not considering the filters applied after this operator, and I >> don't know if the planner infrastructure is able to adjust it by itself. >> This is where I would like reviewer's feedback. I think that the planner >> costs are something to be determined experimentally. >> >> Next I will make it slightly more general handling >> * More index columns: Index (a, b, s...) could support WHERE a IN (...) >> ORDER BY b LIMIT N (ignoring s...) >> * Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c >> * Non-leading prefix: WHERE b IN (...) AND a =3D const ORDER BY c on ind= ex >> (a, b, c) >> >> --- >> Kind Regards, >> Alexandre >> >> On Wed, Feb 4, 2026 at 7:13=E2=80=AFAM Micha=C5=82 K=C5=82eczek wrote: >> >>> >>> >>> On 3 Feb 2026, at 22:42, Ants Aasma wrote: >>> >>> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra wrote: >>> >>> I'm also wondering how common is the targeted query pattern? How common >>> it is to have an IN condition on the leading column in an index, and >>> ORDER BY on the second one? >>> >>> >>> I have seen this pattern multiple times. My nickname for it is the >>> timeline view. Think of the social media timeline, showing posts from >>> all followed accounts in timestamp order, returned in reasonably sized >>> batches. The naive SQL query will have to scan all posts from all >>> followed accounts and pass them through a top-N sort. When the total >>> number of posts is much larger than the batch size this is much slower >>> than what is proposed here (assuming I understand it correctly) - >>> effectively equivalent to running N index scans through Merge Append. >>> >>> >>> My workarounds I have proposed users have been either to rewrite the >>> query as a UNION ALL of a set of single value prefix queries wrapped >>> in an order by limit. This gives the exact needed merge append plan >>> shape. But repeating the query N times can get unwieldy when the >>> number of values grows, so the fallback is: >>> >>> SELECT * FROM unnest(:friends) id, LATERAL ( >>> SELECT * FROM posts >>> WHERE user_id =3D id >>> ORDER BY tstamp DESC LIMIT 100) >>> ORDER BY tstamp DESC LIMIT 100; >>> >>> The downside of this formulation is that we still have to fetch a >>> batch worth of items from scans where we otherwise would have only had >>> to look at one index tuple. >>> >>> >>> GIST can be used to handle this kind of queries as it supports multiple >>> sort orders. >>> The only problem is that GIST does not support ORDER BY column. >>> One possible workaround is [1] but as described there it does not play >>> well with partitioning. >>> I=E2=80=99ve started drafting support for ORDER BY column in GIST - see= [2]. >>> I think it would be easier to implement and maintain than a new IAM (bu= t >>> I don=E2=80=99t have enough knowledge and experience to implement it my= self) >>> >>> [1] >>> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC= 21F%40kleczek.org >>> [2] >>> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5D= C91%40kleczek.org >>> >>> =E2=80=94 >>> Kind regards, >>> Michal >>> >> --0000000000006fb55f064b8503b1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Hackers,

Do you= think that MERGE-SCAN was a terrible name? I wanted a name that wouldn'= ;t=C2=A0

requir= e much explanation. I named it like this because it relies on a k-way=C2=A0

merge = to combine several segments of an index in one result.=C2=A0But we already<= span class=3D"gmail-Apple-converted-space">=C2=A0

have a= MERGE statement. Even in the example plan above we can see an=C2=A0externa= l=C2=A0

merge = that has nothing to do with the new feature, and now as I am doing joins,=C2=A0

I star= ted doing it on the NestedLoop trying to follow the same conditions that=C2=A0

lead t= o a memoize. But I added so many fields to the NestedLoop state that I=C2=A0

think = it is good to have a separate=C2=A0structure, and maybe a separate node, an= d=C2=A0

MergeS= can of course is taken hehe. I was thinking of IndexPrefixMerge. We could=C2=A0

use th= e Ants nickname TimeLineScan, but of course it is not limited to time=C2=A0

lines = (even though realistically, that will probably be the most common use of=C2=A0

this).= Another one I considered was TransposedIndexScan, because it orders=C2=A0

output= on (suffix, prefix) instead of (prefix, suffix).


=



On Fri, Feb 6, 2026 at 10:52=E2=80= =AFAM Alexandre Felipe <= o.alexandre.felipe@gmail.com> wrote:
Hello again hackers!

+pt@bowt.ie: That seems to be = the one that is probably the most familiar with the index scan (based on th= e commits).
+michael@paquier.xyz=C2=A0,=C2=A0+tgl@sss= .pgh.pa.us=C2=A0,=C2=A0+peter@eisentraut.org=C2=A0as the top = 3 committers to nbtree over the last ~6 months.

I=C2=A0 have made substantial=C2=A0progress on adding a few features. I h= ave questions, but I will let you go first :)

Motivation:
In technical terms: this p= roposal is to take advantage of a btree index when the query is filtered by= a few distinct prefixes and ordered by a suffix and has a limit.
=
In non technical: This could help to efficiently render=C2=A0a = social network feed, where each user can select a list of users whose posts= they want to see, and the posts must be ordered from newest to oldest.

Performance Comparison
=

I did a test with a toy table, please find more details below.
<= br>With limit 100

| Method =C2=A0 =C2=A0 | = Shared Hit | Shared Read | Exec Time |
|------------|-----------:|------= ------:|----------:|
| Merge =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 = =C2=A0 13 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 119 | =C2=A0 =C2=A0 13 ms |
| In= dexScan =C2=A0| =C2=A0 =C2=A0 15,308 | =C2=A0 =C2=A0 525,310 | =C2=A03,409 = ms |


With limit 1,000,000

| M= ethod =C2=A0 =C2=A0 | =C2=A0SharedHit | SharRead | Temp I | Temp O | Exec T= ime |
|------------|-----------:|---------:|-------:|-------:|----------= :|
| Merge =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0980,318 | =C2=A0 19,721 | = =C2=A0 =C2=A0 =C2=A00 | =C2=A0 =C2=A0 =C2=A00 | =C2=A02,128 ms |
| Seque= ntial | =C2=A0 =C2=A0 15,208 | =C2=A0525,410 | 20,207 | 35,384 | =C2=A03,76= 2 ms |
| Bitmap =C2=A0 =C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0629 | =C2=A01= 13,759 | 20,207 | 35,385 | =C2=A05,487 ms |
| IndexScan =C2=A0| =C2=A07,= 880,619 | =C2=A0126,706 | 20,945 | 35,386 | =C2=A05,874 ms |


= Sequential scans and bitmap scans in this case reduces significantly the nu= mber of
accessed buff because the table has only four integer columns, = and these methods
can read all the lines on a given page at a time.
=
However that comes at the cost of resorting to an in-disk sort method. =
For the query with limit 100 we get no temp files as we are using a
= top-100 sort.

make check passes


= Experiment details

Consider a 100M ro= w table formed (a,b,c,d) \in 100 x 100 x 100 x 100


```sql
CREATE TABLE grid AS (
=C2=A0 =C2=A0 SELECT a, b,= c, d, FROM
=C2=A0 =C2=A0 =C2=A0 =C2=A0 generate_series(1, 100) AS a, <= br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 generate_series(1, 100) AS b,
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 generate_series(1, 100) AS c,
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 generate_series(1, 100) AS d
);

CREATE INDEX grid_index ON= grid (a, b, c);
ANALYSE grid;
```

Now let's say th= at we need to find certain number of rows filtered by a and ordered by b;```sql
PREPARE grid_query(int) AS
SELECT s= um(d) FROM (
=C2=A0 =C2=A0 SELECT * FROM grid
=C2=A0 =C2=A0 WHERE a = IN (2,3,5,8,13,21,34,55) AND b >=3D 0
=C2=A0 =C2=A0 ORDER BY b
= =C2=A0 =C2=A0 LIMIT $1) t;

```

---


Now with lim= it 100, with index merge scan (notice Index Prefixes in the plan).

`= ``sql
SET enable_indexmergescan =3D on;
EXPLAIN (ANALYSE) EXECUTE gri= d_query(100);
```

```text
=C2=A0 =C2=A0Buffers: shared hit=3D1= 3 read=3D119
=C2=A0 =C2=A0-> =C2=A0Limit =C2=A0(cost=3D0.57..87.29 ro= ws=3D100 width=3D16) (actual time=3D5.528..12.999 rows=3D100.00 loops=3D1)<= br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D119=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Index Scan using grid_a_b_c_= idx on grid =C2=A0(cost=3D0.57..93.36 rows=3D107 width=3D16) (actual time= =3D5.528..12.994 rows=3D100.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: (b >=3D 0)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Prefixes: (a =3D ANY (&#= 39;{2,3,5,8,13,21,34,55}'::integer[]))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Searches: 8
=C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D119
=C2=A0Pl= anning:
=C2=A0 =C2=A0Buffers: shared hit=3D59 read=3D23
=C2=A0Plannin= g Time: 4.619 ms
=C2=A0Execution Time: 13.055 ms
=C2=A0```

```sql
SET enable_indexmergescan =3D off;
EXPLAIN (ANALYSE) EXECUTE = grid_query(100);
```

```text
=C2=A0Aggregate =C2=A0(cost=3D160= 3588.06..1603588.07 rows=3D1 width=3D8) (actual time=3D3406.624..3408.710 r= ows=3D1.00 loops=3D1)
=C2=A0 =C2=A0Buffers: shared hit=3D15308 read=3D52= 5310
=C2=A0 =C2=A0-> =C2=A0Limit =C2=A0(cost=3D1603575.17..1603586.81= rows=3D100 width=3D16) (actual time=3D3406.601..3408.702 rows=3D100.00 loo= ps=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15308 re= ad=3D525310
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Gather Merge = =C2=A0(cost=3D1603575.17..2514342.92 rows=3D7819999 width=3D16) (actual tim= e=3D3406.598..3408.695 rows=3D100.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Planned: 2
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Launched: 2
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15308 read=3D52= 5310
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0= Sort =C2=A0(cost=3D1602575.14..1610720.98 rows=3D3258333 width=3D16) (actua= l time=3D3393.782..3393.784 rows=3D100.00 loops=3D3)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: grid.b=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0Sort Method: top-N heapsort =C2=A0Memory: 32kB
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared = hit=3D15308 read=3D525310
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Method: top-N heapsort = =C2=A0Memory: 32kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Sort Method: top-N heapsort =C2=A0M= emory: 32kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0-> =C2=A0Parallel Seq Scan on grid =C2=A0(cost= =3D0.00..1478044.00 rows=3D3258333 width=3D16) (actual time=3D0.944..3129.8= 96 rows=3D2666666.67 loops=3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Filter: ((b >= =3D 0) AND (a =3D ANY ('{2,3,5,8,13,21,34,55}'::integer[])))
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0Rows Removed by Filter: 30666667
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0Buffers: shared hit=3D15234 read=3D525310
=C2=A0Planning Time: 0.3= 70 ms
=C2=A0Execution Time: 3409.134 ms
=C2=A0```

Now queries = with limit 1,000,000

```sql
SET enable_indexmergescan =3D on;
= EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```

Query executed w= ith the proposed access method. Notice in the plan Index Prefixes and Index= Cond.
```text
=C2=A0 =C2=A0Buffers: shared hit=3D980318 read=3D19721=
=C2=A0 =C2=A0-> =C2=A0Limit =C2=A0(cost=3D0.57..867259.84 rows=3D100= 0000 width=3D16) (actual time=3D2.854..2103.438 rows=3D1000000.00 loops=3D1= )
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D980318 read=3D= 19721
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Index Scan using gri= d_a_b_c_idx on grid =C2=A0(cost=3D0.57..867265.91 rows=3D1000007 width=3D16= ) (actual time=3D2.852..2066.205 rows=3D1000000.00 loops=3D1)
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: (b >=3D 0)
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Prefixes: (a =3D ANY ('{2,3,5,8,13,21,34,55}'::integer[]))
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Searches: 8
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D98031= 8 read=3D19721
=C2=A0Planning Time: 0.328 ms
=C2=A0Execution Time: 21= 27.811 ms
=C2=A0```

If we disable index_mergescan we naturally we= fall into a sequential scan.

```sql
SET enable_indexmergescan = =3D off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```text<= br>=C2=A0 =C2=A0Buffers: shared hit=3D15208 read=3D525410, temp read=3D2020= 7 written=3D35384
=C2=A0 =C2=A0-> =C2=A0Limit =C2=A0(cost=3D1942895.6= 4..2059362.12 rows=3D1000000 width=3D16) (actual time=3D3467.012..3712.044 = rows=3D1000000.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: = shared hit=3D15208 read=3D525410, temp read=3D20207 written=3D35384
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Gather Merge =C2=A0(cost=3D19428= 95.64..2853663.39 rows=3D7819999 width=3D16) (actual time=3D3467.010..3671.= 220 rows=3D1000000.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0Workers Planned: 2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0Workers Launched: 2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15208 read=3D525410, temp rea= d=3D20207 written=3D35384
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0-> =C2=A0Sort =C2=A0(cost=3D1941895.62..1950041.45 rows=3D3258= 333 width=3D16) (actual time=3D3455.852..3476.358 rows=3D334576.33 loops=3D= 3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0Sort Key: grid.b
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Method: external merge =C2=A0Disk: 4701= 6kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207 w= ritten=3D35384
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Method: external merge =C2=A0Disk:= 46976kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0Worker 1: =C2=A0Sort Method: external merge =C2=A0Disk: 47000k= B
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0-> =C2=A0Parallel Seq Scan on grid =C2=A0(cost=3D0.00..1478= 044.00 rows=3D3258333 width=3D16) (actual time=3D2.789..2779.483 rows=3D266= 6666.67 loops=3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Filter: ((b >=3D 0) AND (a = =3D ANY ('{2,3,5,8,13,21,34,55}'::integer[])))
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0Rows Removed by Filter: 30666667
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: = shared hit=3D15134 read=3D525410
=C2=A0Planning Time: 0.332 ms
=C2=A0= Execution Time: 3761.866 ms
```
=C2=A0
If we disable sequential sc= ans, then we get a bitmap scan

```sql
SET enable_seqscan =3D off;=
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```text
=C2= =A0 =C2=A0Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read=3D= 20207 written=3D35385
=C2=A0 =C2=A0-> =C2=A0Limit =C2=A0(cost=3D19981= 99.78..2114666.26 rows=3D1000000 width=3D16) (actual time=3D5170.456..5453.= 433 rows=3D1000000.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffe= rs: shared hit=3D629 read=3D113759 written=3D2, temp read=3D20207 written= =3D35385
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Gather Merge =C2= =A0(cost=3D1998199.78..2908967.53 rows=3D7819999 width=3D16) (actual time= =3D5170.455..5413.254 rows=3D1000000.00 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Planned: 2
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Launched: 2
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D629 read=3D1= 13759 written=3D2, temp read=3D20207 written=3D35385
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Sort =C2=A0(cost=3D199719= 9.75..2005345.59 rows=3D3258333 width=3D16) (actual time=3D5156.929..5177.5= 07 rows=3D334500.67 loops=3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: grid.b
=C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Method: extern= al merge =C2=A0Disk: 47032kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D629 read=3D113759 w= ritten=3D2, temp read=3D20207 written=3D35385
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Me= thod: external merge =C2=A0Disk: 47280kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Sort Method: e= xternal merge =C2=A0Disk: 46680kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Parallel Bitmap Heap Scan= on grid =C2=A0(cost=3D107691.54..1533348.13 rows=3D3258333 width=3D16) (ac= tual time=3D299.891..4489.787 rows=3D2666666.67 loops=3D3)
=C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0Recheck Cond: ((a =3D ANY ('{2,3,5,8,13,21,34,55}'::integ= er[])) AND (b >=3D 0))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Rows Removed by Inde= x Recheck: 2410242
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Heap Blocks: exact=3D13100 = lossy=3D22639
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D615 read=3D1= 13759 written=3D2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Heap Blocks: e= xact=3D13077 lossy=3D22755
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Heap Bl= ocks: exact=3D13036 lossy=3D22421
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0B= itmap Index Scan on grid_a_b_c_idx =C2=A0(cost=3D0.00..105736.54 rows= =3D7820000 width=3D0) (actual time=3D297.651..297.651 rows=3D8000000.00 loo= ps=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: ((a =3D = ANY ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0))
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Searches: 7
=C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D7293 written= =3D2
=C2=A0Planning Time: 0.165 ms
=C2=A0Execution Time: 5487.213 ms<= br>```

If we disable bitmap scans we finally get an index scan
```sql
SET enable_bitmapscan =3D off;
EXPLAIN ANALYSE EXECUTE grid_= query(1000000);
```
```
=C2=A0 =C2=A0Buffers: shared hit=3D7883221= read=3D124111, temp read=3D20699 written=3D35385
=C2=A0 =C2=A0-> =C2= =A0Limit =C2=A0(cost=3D7201203.08..7317669.55 rows=3D1000000 width=3D16) (a= ctual time=3D4414.478..4674.400 rows=3D1000000.00 loops=3D1)
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D7883221 read=3D124111, temp r= ead=3D20699 written=3D35385
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2= =A0Gather Merge =C2=A0(cost=3D7201203.08..8111970.83 rows=3D7819999 width= =3D16) (actual time=3D4414.476..4633.982 rows=3D1000000.00 loops=3D1)
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Planned: 2=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Launched: = 2
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared= hit=3D7883221 read=3D124111, temp read=3D20699 written=3D35385
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Sort =C2=A0(cos= t=3D7200203.05..7208348.88 rows=3D3258333 width=3D16) (actual time=3D4390.6= 25..4411.896 rows=3D334567.00 loops=3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: grid.b
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Meth= od: external merge =C2=A0Disk: 47304kB
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D7= 883221 read=3D124111, temp read=3D20699 written=3D35385
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2= =A0Sort Method: external merge =C2=A0Disk: 47304kB
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Sort= Method: external merge =C2=A0Disk: 46384kB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-> =C2=A0Parallel Ind= ex Scan using grid_a_b_c_idx on grid =C2=A0(cost=3D0.57..6736351.43 row= s=3D3258333 width=3D16) (actual time=3D46.925..3796.915 rows=3D2666666.67 l= oops=3D3)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: ((a =3D ANY ('{2,3,5,8,1= 3,21,34,55}'::integer[])) AND (b >=3D 0))
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0Index Searches: 7
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D78832= 08 read=3D124110
=C2=A0Planning Time: 0.385 ms
=C2=A0Execution Time: = 4713.325 ms
=C2=A0```

=C2=A0


<= /div>


On Thu, Feb 5, 2026 at 6:59=E2=80=AFAM Alexandre = Felipe <o.alexandre.felipe@gmail.com> wrote:
Thank you for looking into t= his.

Now we can execute a, still narrow, family queries!=

Maybe it helps to see this as a=C2=A0social ne= twork=C2=A0feeds. Imagine a social network, you have a few friends, or = follow a few people, and you want to see their updates ordered by date. For= each user we have a different combination of users that we have to display= . But maybe, even having hundreds of users we will only show the first 10.<= /div>

There is a low hanging fruit on the skip scan, if = we need N rows, and one group already has M rows we could stop there.
=
If Nx is the number of friends, and M is the number of posts to show.<= /div>
This runs with complexity (Nx * M) rows, followed by an (Nx * M) = sort, instead of (Nx * N) followed by an (Nx * N) sort.
Where M = =3D 10 and N is 1000 this is a significant improvement.
But if M = ~ N, the merge scan=C2=A0that runs with M=C2=A0+ Nx row accesses, (M=C2=A0+= Nx) heap operations.
If everything is on the same page the skip = scan would win.

The cost estimation is probably fa= r off.
I am also not considering the filters=C2=A0applied after t= his=C2=A0operator,=C2=A0and I don't know if the planner infrastructure = is able to adjust it by itself.
This is where I would like review= er's feedback. I think that the planner costs are something to be deter= mined experimentally.

Next I will make it slightly= more general handling
* More index columns: Index (a, b, s...) could su= pport WHERE a IN (...) ORDER BY b LIMIT N (ignoring s...)
* Multi-column= prefix: WHERE (a, b) IN (...) ORDER BY c
* Non-leading prefix: WHERE b = IN (...) AND a =3D const ORDER BY c on index (a, b, c)

=
---
Kind Regards,
Alexandre

On Wed, Feb 4, 20= 26 at 7:13=E2=80=AFAM Micha=C5=82 K=C5=82eczek <michal@kleczek.org> wrote:
=


On 3 Feb 2026, at 22:42, Ants Aasma <ants.aasma@cybertec.at> wr= ote:

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wro= te:
I'm also wondering how common is the t= argeted query pattern? How common
it is to have an IN condition on the l= eading column in an index, and
ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is thetimeline view. Think of the social media timeline, showing posts from
= all followed accounts in timestamp order, returned in reasonably sized
b= atches. The naive SQL query will have to scan all posts from all
followe= d accounts and pass them through a top-N sort. When the total
number of = posts is much larger than the batch size this is much slower
than what i= s proposed here (assuming I understand it correctly) -
effectively equiv= alent to running N index scans through Merge Append.

My workarounds I have proposed us= ers have been either to rewrite the
query as a UNION ALL of a set of sin= gle value prefix queries wrapped
in an order by limit. This gives the ex= act needed merge append plan
shape. But repeating the query N times can = get unwieldy when the
number of values grows, so the fallback is:
SELECT * FROM unnest(:friends) id, LATERAL (
=C2=A0=C2=A0=C2=A0SELECT = * FROM posts
=C2=A0=C2=A0=C2=A0WHERE user_id =3D id
=C2=A0=C2=A0=C2= =A0ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;
The downside of this formulation is that we still have to fetch a
batc= h worth of items from scans where we otherwise would have only had
to lo= ok at one index tuple.

GIST= can be used to handle this kind of queries as it supports multiple sort or= ders.
The only problem is that GIST does not support ORDER BY col= umn.
One possible workaround is [1] but as described there it doe= s not play well with partitioning.
I=E2=80=99ve started drafting = support for ORDER BY column in GIST - see [2].
I think it would b= e easier to implement and maintain than a new IAM (but I don=E2=80=99t have= enough knowledge and experience to implement it myself)
=E2=80=94
Kind regards,
Michal
--0000000000006fb55f064b8503b1--