Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vnOqE-006VAB-1K for pgsql-hackers@arkaria.postgresql.org; Tue, 03 Feb 2026 22:25:50 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vnOqD-007aU3-1J for pgsql-hackers@arkaria.postgresql.org; Tue, 03 Feb 2026 22:25:49 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vnOqC-007aTv-2t for pgsql-hackers@lists.postgresql.org; Tue, 03 Feb 2026 22:25:48 +0000 Received: from relay6-d.mail.gandi.net ([2001:4b98:dc4:8::226]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vnOq9-00000000QRi-3oJf for pgsql-hackers@postgresql.org; Tue, 03 Feb 2026 22:25:47 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 820D0443EF; Tue, 3 Feb 2026 22:25:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1770157538; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=itb87J6Ju8vUWr63lpht6LD4Uk+hdoI7qnb3xAwBKqg=; b=RAMKJQUE89uziwpFz/41q+r0MmEY6Tk7eGenzuIaAOXRhvK29yoOdGCamoKIde9ecsSBOI w7ECw/ZJTUa2BE5AZzpbzg+p2llwOiG7MEobkOLtDwYm8MuOfKIBOb/fEc3oCrl9OXPfqV smZ65usPBa2Yda9fOUn1DKBSjYnMFPCGi78vfwfDaYPCkUvqahnWXKmAHVVYwT3DM5OMvV cK0uop0S2pDh8HgV1FweZ+pwLjci6KzKqdaJJV0AWyibrKugxUiNKekWQ1v5v+0ZJUyBzN 8nKSpkN5/8Vlukjbjfwg1gq0eFZlMYo4IhL9eU30C7YWVsaY8YSfHSWwi7soUg== Message-ID: Date: Tue, 3 Feb 2026 23:25:36 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: New access method for b-tree. To: Matthias van de Meent Cc: Alexandre Felipe , "pgsql-hackers@postgresql.org" References: Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-GND-Sasl: tomas@vondra.me X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddukeduvdegucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefkffggfgfuvfevfhfhjggtgfesthejredttddvjeenucfhrhhomhepvfhomhgrshcugghonhgurhgruceothhomhgrshesvhhonhgurhgrrdhmvgeqnecuggftrfgrthhtvghrnhepledugeeikefglefhgfffuedvleetteevgefhvdeikeefudduuddvhfevudefhfevnecukfhppeekiedrgeelrddvfedtrddvtdeinecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepkeeirdegledrvdeftddrvddtiedphhgvlhhopegluddtrddufeejrddtrddvngdpmhgrihhlfhhrohhmpehtohhmrghssehvohhnughrrgdrmhgvpdhqihgupeekvddtffdtgeegfefghfdpmhhouggvpehsmhhtphhouhhtpdhnsggprhgtphhtthhopeefpdhrtghpthhtohepsghovghkvgifuhhrmhesghhmrghilhdrtghomhdprhgtphhtthhopegrlhgvgigrnhgurhgvrdhfvghlihhpvgesthhprhhordhiohdprhgtphhtthhopehpghhsqhhlqdhhrggtkhgvrhhssehpohhsthhgrhgvshhqlhdrohhrgh List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 2/3/26 17:01, Matthias van de Meent wrote: > On Mon, 2 Feb 2026 at 00:54, Tomas Vondra wrote: >> >> Hello Felipe, >> >> On 2/1/26 11:02, Alexandre Felipe wrote: >>> Hello Hackers, >>> >>> Please check this out, >>> >>> It is an access method to scan a table sorting by the second column of >>> an index, filtered on the first. >>> Queries like >>> SELECT x, y FROM grid >>> WHERE x in (array of Nx elements) >>> ORDER BY y, x >>> LIMIT M >>> >>> Can execute streaming the rows directly from disk instead of loading >>> everything. > > +1 for the idea, it does sound interesting. I haven't looked in depth > at the patch, so no comments on the execution yet. > >> So how does this compare to skip scan in practice? It's hard to compare, >> as the patch does not implement an actual access path, but I tried this: > [...] >> 1) master >> >> SELECT * FROM merge_test_int >> WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15) > [...] >> 2) merge scan >> >> SELECT * FROM test_btree_merge_scan_int( > [...] >> ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], > [...] >> And with explain analyze we get this: >> >> 1) Buffers: shared hit=26 read=25 >> 2) Buffers: shared hit=143 read=17 > > (FYI; your first query was missing "2" from it's IN list while it was > present in the merge scan input; this makes the difference worse by a > few pages) > >> So it seems to access many more buffers, even if the number of reads is >> lower. Presumably the merge scan is not always better than skip scan, >> probably depending on number of prefixes in the query etc. What is the >> cost model to decide between those two? > > Skip scan always returns data in index order, while this merge scan > would return tuples a suffix order. The cost model would thus weigh > the cost of sorting the result of an index skipscan against the cost > of doing a merge join on n_in_list_items distinct (partial) index > scans. > Makes sense. > As for when you would benefit in buffers accessed: The merge scan > would mainly benefit in number of buffers accessed when the selected > prefix values are non-sequential, and the prefixes cover multiple > pages at a time, and when there is a LIMIT clause on the scan. Normal > btree index skip scan infrastructure efficiently prevents new index > descents into the index when the selected SAOP key ranges are directly > adjecent, while merge scan would generally do at least one index > descent for each of its N scan heads (*) - which in the proposed > prototype patch guarantees O(index depth * num scan heads) buffer > accesses. > Do we have sufficient information to reliably make the right decision? Can we actually cost the two cases well enough? > (*) It is theoretically possible to reuse an earlier index descent if > the SAOP entry's key range of the last descent starts and ends on the > leaf page that the next SAOP entry's key range also starts on > (applying the ideas of 5bf748b86b to this new multi-headed index scan > mode), but that infrastructure doesn't seem to be in place in the > current patch. That commit is also why your buffer access count for > master is so low compared to the merge scan's; if your chosen list of > numbers was multiples of 5 (so that matching tuples are not all > sequential) you'd probably see much more comparable buffer access > counts. > >> If you had to construct the best case and worst cases (vs. skip scan), >> what would that look like? > > Presumably the best case would be: > > -- mytable.a has very few distinct values (e.g. bool or enum); > mytable.b many distinct values (e.g. uuid) > SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b; > > which the index's merge scan would turn into an index scan that > behaves similar to the following, possibly with the merge join pushed > down into the index: > > SELECT * FROM ( > SELECT ... FROM mytable WHERE a = 1 > UNION > SELECT ... FROM mytable WHERE a = 2 > ) ORDER BY b. > > > The worst case would be the opposite: > > -- mytable.a has many distinct values (random uuid); mytable.b few > (e.g. boolean; enum) > SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b > > As the merge scan maintains one internal indexscan head per SAOP array > element, it'd have significant in-memory and scan startup overhead, > while few values are produced for each of those scan heads. > OK. It'll be interesting to see how this performs in practice for the whole gamut between the best and worst case. >> I'm also wondering how common is the targeted query pattern? How common >> it is to have an IN condition on the leading column in an index, and >> ORDER BY on the second one? > > I'm not sure, but it seems like it might be common/useful in > queue-like access patterns: > > With an index on (state, updated_timestamp) you're probably interested > in all messages in just a subset of states, ordered by recent state > transitions. An index on (updated_timestamp, state) might be > considered more optimal, but won't be able to efficiently serve > queries that only want data on uncommon states: The leaf pages would > mainly contain data on common states, reducing the value of those leaf > pages. > > Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query > using UNION, or add an index for each percievable IN list, but it'd be > great if the user didn't have to rewrite their query or create > n_combinations indexes with their respective space usage to get this > more efficient query execution. > I think the examples presented by Ants (with timeline view) are quite plausible in practice. regards -- Tomas Vondra