Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vnIqj-004ruG-2x for pgsql-hackers@arkaria.postgresql.org; Tue, 03 Feb 2026 16:01:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vnIqi-005b8K-37 for pgsql-hackers@arkaria.postgresql.org; Tue, 03 Feb 2026 16:01:56 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vnIqi-005b8C-1i for pgsql-hackers@lists.postgresql.org; Tue, 03 Feb 2026 16:01:56 +0000 Received: from mail-lj1-x232.google.com ([2a00:1450:4864:20::232]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vnIqg-00000000Ngp-28Ls for pgsql-hackers@postgresql.org; Tue, 03 Feb 2026 16:01:55 +0000 Received: by mail-lj1-x232.google.com with SMTP id 38308e7fff4ca-38316d0c26eso50286981fa.2 for ; Tue, 03 Feb 2026 08:01:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1770134512; cv=none; d=google.com; s=arc-20240605; b=dB/ctia1Xph3PHiie4uMeqBfiP8xJnud5Rr4B2QlTZ42v8qeJ/RVrMkzXmoW0GORha gDqF9PBMAeGwqz5cqYbompRuP9QvvE4TO99n5xGLX6hiIlhscpqzcL4l2NeDgu7KfwxD ZN8Znrf4XTyC6sAp4BvwrJoC38S8IW+sju+SrGyVKKH2/zrrEXJmwzmWqifqvnpuNRgA splBS7o/XFrzJOwpEaj6o+xTXcUAFeckBgjDrkm1JLL28Td092rw4gOwCvQtuefGfujS xiO65zju9EeT5oG4zniqJXR93Riw9d+RbxG4Nweq0euKTuqz46GR5j77KQ5iIfVoMG16 DL4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=YDml6z6CfZ6l1o7gYNHdimKJsjSVEMXaCX9MsYZjYcM=; fh=Ib9TvLqbT1qRCnBf6PprLwoUDh7ruNcj06O1BsGokgs=; b=EbnEOeoTs0fOalwzKJy9N4CSnPW4EQDULEIN2SH1fyUh8P+9kCDjqp3o35Siwxg0H+ tZFKhBFHY7sC2tpZfAjuXMlqW5TiJu7RA+FVRBQ00o7dLr8iePl8xvi9wOdBerRV1iE9 N0j9p12L1Qyx7NxfZbHx02DTZR+CR/9nuSk4ZzkK/XDjsQdZyOTEcSLqGBYW4gc2H6ya nywhob/Id7QxEOMqADgpn3IS53azkgsH12Y4tu3gI97rgXRxd+60JTABEJAhYSwZONWR tWYoDHhzjNYGRaIk0q/5Fb2FVQ2CBeWZo0TBKOnT0bK/Q2fOKQyQbnueZOZ4DC2Y9DzR uPiA==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1770134512; x=1770739312; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=YDml6z6CfZ6l1o7gYNHdimKJsjSVEMXaCX9MsYZjYcM=; b=J7qSbJH5fFt4KwW6cwqNgahuZeCLYzaPcj7lcqL5fhzjmBkCR4UXNKfwcpXMsq3Y5a YQfq59d7U+KXiCfGRAt6Tynv9X+Z+UssCaCyDqW0h1vzoNuZlUwmMTpIVu+iJ3vySldj QQyrTvBJ7r4nXyUHhLCP7YSpeLM1rA2Biz42Dox20crjmVGBp3Z9+/wymzuttu1xp/WD 9XPgACTv2z/YG+rpMRslf1hgvUk2QGk7Muzu0W6/bTni7FaYubXUjk7PbsCqdTatEy3a c4Y3k9iBtvkcl4DBd3KECEuD+/wZB6RkyhSiiJhKAH3uK5GRPOgYFLpwW/n6RjBqHi1Q F6Vw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770134512; x=1770739312; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=YDml6z6CfZ6l1o7gYNHdimKJsjSVEMXaCX9MsYZjYcM=; b=RIZC5/YzsFzp/GbLX8I+vi7y5kvbURkCfTftgYmrdtruh7mgSvhv5Yai/irIru6gFB VsKMwJEAK9bTnfNfsnk7c+xl8d14URy0gJKjxlsrEPFny21e5pNwHsNVzy4BBtXPCUDM pP5T4oRC930VLrcsTlYC7JDQDP+U5Wz+G0Jm77EFm0UVZaUVlAKRqnPWHEGWM0vrL7G8 NLBJUA8d574+IXWFQneS3pHrWDUm5hkC+XUVc3Cxai+u24JMr/TLZAUD4JtF1OeNbSJA Yxwkaoxzvi7bPMUcaElG2+f5u3uWnCpo/RKfGDQcjboAQsZCwU33UtgRPhtPwCVvVbRb NlFw== X-Forwarded-Encrypted: i=1; AJvYcCVH4yr9NJR0i8LObDJOnp7VpCGg5Wg2Mxfprb2abk48Lp7XIi0/WUto79Viny+Dgr1RGf9LHVloDirHeIUI@postgresql.org X-Gm-Message-State: AOJu0Ywksn1OfhYllRiUQceafbQ6DYtfU9aRezxmCd78e8rrtB/QxMDW gCyvVooeZ6x3NHG6CBmVsT6XnSBnU37Ukxutb7iKQFYMKCr/bQSreyFFPgNmPppomkvTxjeeUER 04OuBtmA2K2gKbaoR9F71q3P4+AQ2pxQ= X-Gm-Gg: AZuq6aKKUGhp2E6osMKPTVfemcyLOGUwthhnmAv+ziYLUqba3Alxq+Rk7Go9zWgliCu CPaFwqmKV4b6v2hrmSzYrEXXF70a0WQsCHhz2q9cEwzED5WXfK47ICPpcbA0gluZ8nnbVHrDZ/4 AJAGYP2jvt3Z2n6jhPYY34BkmhmQb+VKvm1I7RH9p95iewm1/S50OJxYEOuddYhuRDml0qXurV1 y+y5m8F9Rhspnk+EFnv0Hbl4mT/+ZCBQgc93W2AWKRBSofzH3lIkB5rJY2gIUHDFQyZ5aJRKj95 qYg706ytosqzXyV1vw6J90PeiTraCxB7a5FwZZW64w7TWcsjMPNP/U92noofDxgga/4mIjHUqhW XayM= X-Received: by 2002:a2e:a54b:0:b0:383:5c1b:4973 with SMTP id 38308e7fff4ca-38691c6d9e5mr652581fa.2.1770134511306; Tue, 03 Feb 2026 08:01:51 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Matthias van de Meent Date: Tue, 3 Feb 2026 17:01:39 +0100 X-Gm-Features: AZwV_Qh1kLDxoSNWvx5s6riiQCI3AOY0Cu0eFAnF3iFjYS8osge4iDZkBI76Mcc Message-ID: Subject: Re: New access method for b-tree. To: Tomas Vondra Cc: Alexandre Felipe , "pgsql-hackers@postgresql.org" Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Mon, 2 Feb 2026 at 00:54, Tomas Vondra wrote: > > Hello Felipe, > > On 2/1/26 11:02, Alexandre Felipe wrote: > > Hello Hackers, > > > > Please check this out, > > > > It is an access method to scan a table sorting by the second column of > > an index, filtered on the first. > > Queries like > > SELECT x, y FROM grid > > WHERE x in (array of Nx elements) > > ORDER BY y, x > > LIMIT M > > > > Can execute streaming the rows directly from disk instead of loading > > everything. +1 for the idea, it does sound interesting. I haven't looked in depth at the patch, so no comments on the execution yet. > So how does this compare to skip scan in practice? It's hard to compare, > as the patch does not implement an actual access path, but I tried this: [...] > 1) master > > SELECT * FROM merge_test_int > WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15) [...] > 2) merge scan > > SELECT * FROM test_btree_merge_scan_int( [...] > ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], [...] > And with explain analyze we get this: > > 1) Buffers: shared hit=26 read=25 > 2) Buffers: shared hit=143 read=17 (FYI; your first query was missing "2" from it's IN list while it was present in the merge scan input; this makes the difference worse by a few pages) > So it seems to access many more buffers, even if the number of reads is > lower. Presumably the merge scan is not always better than skip scan, > probably depending on number of prefixes in the query etc. What is the > cost model to decide between those two? Skip scan always returns data in index order, while this merge scan would return tuples a suffix order. The cost model would thus weigh the cost of sorting the result of an index skipscan against the cost of doing a merge join on n_in_list_items distinct (partial) index scans. As for when you would benefit in buffers accessed: The merge scan would mainly benefit in number of buffers accessed when the selected prefix values are non-sequential, and the prefixes cover multiple pages at a time, and when there is a LIMIT clause on the scan. Normal btree index skip scan infrastructure efficiently prevents new index descents into the index when the selected SAOP key ranges are directly adjecent, while merge scan would generally do at least one index descent for each of its N scan heads (*) - which in the proposed prototype patch guarantees O(index depth * num scan heads) buffer accesses. (*) It is theoretically possible to reuse an earlier index descent if the SAOP entry's key range of the last descent starts and ends on the leaf page that the next SAOP entry's key range also starts on (applying the ideas of 5bf748b86b to this new multi-headed index scan mode), but that infrastructure doesn't seem to be in place in the current patch. That commit is also why your buffer access count for master is so low compared to the merge scan's; if your chosen list of numbers was multiples of 5 (so that matching tuples are not all sequential) you'd probably see much more comparable buffer access counts. > If you had to construct the best case and worst cases (vs. skip scan), > what would that look like? Presumably the best case would be: -- mytable.a has very few distinct values (e.g. bool or enum); mytable.b many distinct values (e.g. uuid) SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b; which the index's merge scan would turn into an index scan that behaves similar to the following, possibly with the merge join pushed down into the index: SELECT * FROM ( SELECT ... FROM mytable WHERE a = 1 UNION SELECT ... FROM mytable WHERE a = 2 ) ORDER BY b. The worst case would be the opposite: -- mytable.a has many distinct values (random uuid); mytable.b few (e.g. boolean; enum) SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b As the merge scan maintains one internal indexscan head per SAOP array element, it'd have significant in-memory and scan startup overhead, while few values are produced for each of those scan heads. > I'm also wondering how common is the targeted query pattern? How common > it is to have an IN condition on the leading column in an index, and > ORDER BY on the second one? I'm not sure, but it seems like it might be common/useful in queue-like access patterns: With an index on (state, updated_timestamp) you're probably interested in all messages in just a subset of states, ordered by recent state transitions. An index on (updated_timestamp, state) might be considered more optimal, but won't be able to efficiently serve queries that only want data on uncommon states: The leaf pages would mainly contain data on common states, reducing the value of those leaf pages. Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query using UNION, or add an index for each percievable IN list, but it'd be great if the user didn't have to rewrite their query or create n_combinations indexes with their respective space usage to get this more efficient query execution. Kind regards, Matthias van de Meent Databricks (https://www.databricks.com)