Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vLWny-002QuF-2o for pgsql-hackers@arkaria.postgresql.org; Wed, 19 Nov 2025 01:16:19 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vLWnw-009ch6-0Y for pgsql-hackers@arkaria.postgresql.org; Wed, 19 Nov 2025 01:16:16 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vLWnv-009cgx-2R for pgsql-hackers@lists.postgresql.org; Wed, 19 Nov 2025 01:16:16 +0000 Received: from mail-wr1-x434.google.com ([2a00:1450:4864:20::434]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vLWnn-000Frk-14 for pgsql-hackers@postgresql.org; Wed, 19 Nov 2025 01:16:14 +0000 Received: by mail-wr1-x434.google.com with SMTP id ffacd0b85a97d-42b427cda88so4388184f8f.0 for ; Tue, 18 Nov 2025 17:16:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1763514966; x=1764119766; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zQg450cbZ19ySWQWKC/jPtMaCl+suVpPKRXvFmCC9bo=; b=eZlD5ffZI/y1ZWJTkU6rLCMV+k+hbcETA2pwhefLIjmeV1U29U03jQIa4rMrYHE3IS Ne5m9VyQggcLss0efhhxtpsTlcwPYb91xjv9UmBkppKzqF6qkTh+BQOfeZ19scp1mOOH haZP+xaxlG0XLCD9yc4mtQqk7Bf1m5prk6s9pMuaCC0+sjMqdyOtk373F/6pbUWY+RPY N3iEH44ykW+SCudmAcU7TqhYz3DYeYy2C2ogKb4fYTN/B9JndPILyq7aoq4BPGPU7ULy MY8X5swlErZkHh4+RMNL/V2bfZqLukZqq7i9hl9uECJVh6UhdVshXMcB7hIblVCWemq7 IyOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1763514966; x=1764119766; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=zQg450cbZ19ySWQWKC/jPtMaCl+suVpPKRXvFmCC9bo=; b=QiWfRiFcFn7+PzES/ZZiQebEqp/TArH9twQ7FegFdZG4LFCKSs07d+Cihwu2n9ZXkV kbOcuXXcJzP9IG2r5qaDB1V3amTZDmPy/NSIb8RRiHss08amu7/LvxpP/mq+jTZH43V9 WlUdKCyBiI7fD+4za0VfNAtqUE62H21aKDulWUcoAZSAugje5jXpPoR8sIJz0ALGJ3tt YX7Q1NnUdnL1Zaqvct3AxVNoqOGzksQEj4McXyt6uUcU1hH0k/WtdHwZb7i46kdzCc1K nMLTWTJMG65LlMyDwtBdjxOoPkrSHijih4BtYPV7SxY2L+dfk9e6OzfAqI0uOHDeN7AA ctlg== X-Gm-Message-State: AOJu0Yy/zaz3MsDlKGP063EqA7Ag9A4KyN4DWcJCHNy0BTJ/gb24wGbB DZdRg3SijT6JzbrKTaffxO9tL1XDy0IinzRU2MrL08xfK9Z8FeFKHfzayDMN6epj7oj7GSGYWT/ MZol4fPMt7zcWXBGSyLcXdJHr1IcGmdIag7nTjT3eSQ== X-Gm-Gg: ASbGnctN4sS/2I053Vdd5j3PiU9gFW6k3ht7i3oLtk2h3hODOnSdqF2/dQxoQmN1gin zn5s/QjZ5OVIlPxjxVi09ejGKEK2IrAewR7otcV8M9TMKXmCqnDIcwhk9eKELYABOll26nqA5// 973lE4rNhOEYL7LTdWB8URzltBDvBt6/VnnBsG2QCZ/KbaB0LoDxi/+IDdaZyn8jr92xkgfdDVZ u/+jtmXFgZCJbxCZDGk4Mb5X+tp+XNR8GiIQ5kcUkkV3l2ptjDQFyc2CgFP/o7SJgONk84= X-Google-Smtp-Source: AGHT+IFWNz/uOfFvPkaK81+hQrIeLRaFT84FbrUpxys+Q//HXL1od67U4zfKOvNoEksiiAxGzTqDSdmJfCY8/2Z1ryQ= X-Received: by 2002:a05:6000:26c8:b0:42b:3ee9:4773 with SMTP id ffacd0b85a97d-42b59323452mr18190631f8f.7.1763514966003; Tue, 18 Nov 2025 17:16:06 -0800 (PST) MIME-Version: 1.0 References: <20251118162801.27cb13408cb3c9dd3a72cf7d@sraoss.co.jp> In-Reply-To: <20251118162801.27cb13408cb3c9dd3a72cf7d@sraoss.co.jp> From: Peter Geoghegan Date: Tue, 18 Nov 2025 20:15:40 -0500 X-Gm-Features: AWmQ_bnIihqlDGEGPM52799uj7qS1LY7WWJH1EZh4HDefbbEUCV4k1nlSE1AslY Message-ID: Subject: Re: Add a berief general comment on BTScanInsertData's nextkey and backward To: Yugo Nagata Cc: Pgsql Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Nov 18, 2025 at 2:28=E2=80=AFAM Yugo Nagata w= rote: > I've attached a patch that adds the following comment: > > + * nextkey determines how the scankey's boundary is interpreted, and bac= kward > + * indicates a backward scan. See comments in _bt_first for a more deta= iled > + * explanation of these fields. > > What do think? Seems reasonable to me. I wonder if we also do something about these existing _bt_binsrch comments: * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber * of the last key < given scankey, or last key <=3D given scankey if nextk= ey * is true. (Since _bt_compare treats the first data key of such a page as * minus infinity, there will be at least one key < scankey, so the result * always points at one of the keys on the page.) Here we describe what happens on an internal page. This is correct, but *seems* to contradict the higher level comments at the end of _bt_first. There is actually no contradiction between the two -- not when you understand that _bt_first describes the whole end-to-end process of calling _bt_search and then calling _bt_binsrch on the leaf page (not of calling _bt_binsrch once, against an internal page). One could think of this _bt_binsrch internal page behavior as compensating for the fact that internal pages have pivot tuples that consist of a separator key (except for the first key, which is just has a -inf key/no key), plus a downlink that points to the *next* page after that separator key one level down (except for the internal page high key, which has no downlink at all). Might be useful to say something like that instead. This is hard to explain without an example. Right now, an internal page might have pivot tuples that look like this: Separator: -inf, Downlink: 1 Separator: 'a', Downlink: 2 Separator: 'c', Downlink: 3 Separator: 'f', Downlink: (none, this is the high key) But _bt_binsrch "pretends" that our internal page actually contains: Downlink: 1 Separator: 'a' Downlink: 2 Separator: 'c' Downlink: 3 Separator: 'f' So if our =3D scan key is (say) 'c', we should descend using the downlink that points to block 2 (not the one that points to block 3, as might be expected from looking at the real physical representation consisting of pivot tuples). That is what the scan needs to do to get to the page one level down whose high key is also 'c' -- that's where the scan needs to look. (If the next level down is the leaf level, then the leaf page we descend to might also contain a *non*-pivot tuple with the key value 'c', "right before" the high key with 'c', which the scan will need to return in _bt_readpage. Lehman & Yao allow the key before a leaf page's high key to be equal to the high key, but otherwise forbid all duplicate keys.) I find it very hard to know what explanation will be the least confusing to other people, at least in this area. Since you're interested in this area, I thought I'd ask what you think. (Here I'm pretending that the keys in the tree are unique without needing a heap TID, per classic L&Y, which is unrealistic but simplifies the example.) -- Peter Geoghegan