Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vDRAN-008stQ-HQ for pgsql-hackers@arkaria.postgresql.org; Mon, 27 Oct 2025 17:37:59 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1vDRAL-008XWX-R2 for pgsql-hackers@arkaria.postgresql.org; Mon, 27 Oct 2025 17:37:56 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vDRAL-008XWO-2s for pgsql-hackers@lists.postgresql.org; Mon, 27 Oct 2025 17:37:56 +0000 Received: from mail-wr1-x432.google.com ([2a00:1450:4864:20::432]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vDRAH-0044IR-1F for pgsql-hackers@postgresql.org; Mon, 27 Oct 2025 17:37:54 +0000 Received: by mail-wr1-x432.google.com with SMTP id ffacd0b85a97d-3ee64bc6b85so5145329f8f.3 for ; Mon, 27 Oct 2025 10:37:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1761586671; x=1762191471; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=EGHMOmiVt4Q5XbFyxcIPh3kt/oruqRJtsPReAH9CCWM=; b=E4zHk/o5374tKo141mvsUF84ImEWvRaclEPvget7lAdlT4kXGsCjMfXhjxsiLX4AeU ZrwD5ovL1i6Fgk4tThzDlZqrTZMSQBkGWzQUhycki6K9motKIVwLriuN/pc91TCXBxDM 9qooIcQyopOgHfRoDICGHoBj/dWzt0B/U78YNaRrSi4Cet05pZ1nWkoCpIL2+Jqw+NPM 5bVPFfvnWZTSxsHTWW7SBsgBvqoezIT5/BFphOS15Ozm5O7AdYtGXKZB9xuRXy9h8ojV kvDhxu3GU3DYX2BLdb/hPieu7k8eMGAnHpzBLB920Hmm4hV8migHer2K0FHq9BRWzSlP v2MQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761586671; x=1762191471; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EGHMOmiVt4Q5XbFyxcIPh3kt/oruqRJtsPReAH9CCWM=; b=vCnccWz9Wro57sg/N+iFYYd4GZQzwdcS0VY5d32RG36Mtms0/qL3kuYI3gp89hmN// 4m5h7Nswl1aVEt8cMKNhwZ1RIlfUYMYp+VQQBPddb10t2ntwRuQXDYGb2mgqR9W8xRR0 LCRVcf0ZFelxB2Fxw4GYuFHOkDh1uydbTCB6wlFnkG/eYmQEL44zRs3jkdLCn68nG410 6tks5IpwPG4eTyctor7C1VSaas/o7gyrS/2uMA25Lq3QIG9mxvtwKi6D0s5eby3yTG5e LjOApgmwrH7VWBiesIH2XknjRraUAKendJnJ5PRNAsJh8mynXS+CC1m7ev/jUh4jAipt XilA== X-Forwarded-Encrypted: i=1; AJvYcCVQKHOlK5Q5FR36DWMGdp61cRc/oaAsprkLaU7UctJUTtnJ+6znZ87m/9SJmUd7HsbdOoxjQkgZd2x3s4m6@postgresql.org X-Gm-Message-State: AOJu0YzkE0KjTxRwSVgz/CeXHIbruN560itEEAKrdlZC38vuBBnUUbaG 5akdyCPoQTynB9dcwGxtLu5CzbuOzrTi38kpJ6zGTY21yBrLc9QWycVh04yRLp82XOtawZoCrUp gBLBhdvw0IfEoi2a6/20yMMUUB3bI3HBn/NwYYIZn5Q== X-Gm-Gg: ASbGncsuqKwY0yzzoZsxkCQxIM6YWDwPhhORE5nOCZua1JCyAzpa+7nNnDJlYGgSmUB aGOaf2yUUv15RLjeP/tND+mK/E02xJfkKu8KXa7FhrHfXEYM7bIVweb+Qzx+CaHqXKuK9PGvXoB 0hpZlVPLSPj1aHSseWVlXq2xAasYeW8mO8kJTHJ2p70q5Lyz868fVSQL+S4iHG63mTa00uznGVe BcnRC3XQPtwHrT0QydqfLKdh2JQVRvk5MpDL5BTH3Bg2EHMRUNtNYCNygvG X-Google-Smtp-Source: AGHT+IEQBAMrRP0VKy1WaEMcTKfjgCqkCKIJw2jgPfhzAkjrZS1m5DeWZRpN9WTBRBPIfEkRG8mk5AM/CYhH5TXHxVw= X-Received: by 2002:a05:6000:1862:b0:427:6c7:6703 with SMTP id ffacd0b85a97d-429a7e86d13mr576947f8f.63.1761586671396; Mon, 27 Oct 2025 10:37:51 -0700 (PDT) MIME-Version: 1.0 References: <68f3771f-91f5-4cb7-b1de-74d9abbf0b96@vondra.me> In-Reply-To: <68f3771f-91f5-4cb7-b1de-74d9abbf0b96@vondra.me> From: Peter Geoghegan Date: Mon, 27 Oct 2025 13:37:25 -0400 X-Gm-Features: AWmQ_bnnXQ0h5cjVzukXKoj0ZevFxVo21qj5vnOArh8HeI99B_sCO2u0jtb4tcQ Message-ID: Subject: Re: Batching in executor To: Tomas Vondra Cc: Amit Langote , PostgreSQL-development Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Mon, Sep 29, 2025 at 7:01=E2=80=AFAM Tomas Vondra wrot= e: > While looking at the patch, I couldn't help but think about the index > prefetching stuff that I work on. It also introduces the concept of a > "batch", for passing data between an index AM and the executor. It's > interesting how different the designs are in some respects. I'm not > saying one of those designs is wrong, it's more due different goals. I've been working on a new prototype enhancement to the index prefetching patch. The new spinoff patch has index scans batch up calls to heap_hot_search_buffer for heap TIDs that the scan has yet to return. This optimization is effective whenever an index scan returns a contiguous group of TIDs that all point to the same heap page. We're able to lock and unlock heap page buffers at the same point that they're pinned and unpinned, which can dramatically decrease the number of heap buffer locks acquired by index scans that return contiguous TIDs (which is very common). I find that speedups for pgbench SELECT variants with a predicate such as "WHERE aid BETWEEN 1000 AND 1500" can have up to ~20% higher throughput, at least in cases with low client counts (think 1 or 2 clients). These are cases where everything can fit in shared buffers, so we're not getting any benefit from I/O prefetching (in spite of the fact that this is built on top of the index prefetching patchset). It makes sense to put this in scope for the index prefetching work because that work will already give code outside of an index AM visibility into which group of TIDs need to be read next. Right now (on master) there is some trivial sense in which index AMs use their own batches, but that's completely hidden from external callers. > For example, the index prefetching patch establishes a "shared" batch > struct, and the index AM is expected to fill it with data. After that, > the batch is managed entirely by indexam.c, with no AM calls. The only > AM-specific bit in the batch is "position", but that's used only when > advancing to the next page, etc. The major difficulty with my heap batching prototype is getting the layering right (no surprises there). In some sense we're deliberately sharing information across different what we currently think of as different layers of abstraction, in order to be able to "schedule" the work more intelligently. There's a number of competing considerations. I have invented a new concept of heap batch, that is orthogonal to the existing concept of index batches. Right now these are just an array of HeapTuple structs that relate to exactly one group of group of contiguous heap TIDs (i.e. if the index scan returns TIDs even a little out of order, which is fairly common, we cannot currently reorder the work in the current prototype patch). Once a batch is prepared, calls to heapam_index_fetch_tuple just return the next TID from the batch (until the next time we have to return a TID pointing to some distinct heap block). In the case of pgbench queries like the one I mentioned, we only need to call LockBuffer/heap_hot_search_buffer once for every 61 heap tuples returned (not once per heap tuple returned). Importantly, the new interface added by my new prototype spinoff patch is higher level than the existing table_index_fetch_tuple/heapam_index_fetch_tuple interface. The executor asks the table AM "give me the next heap TID in the current scan direction", rather than asking "give me this heap TID". The general idea is that the table AM has a direct understanding of ordered index scans. The advantage of this higher-level interface is that it gives the table AM maximum freedom to reorder work. As I said already, we won't do things like merge together logically noncontiguous accesses to the same heap page into one physical access right now. But I think that that should at least be enabled by this interface. The downside of this approach is that table AM (not the executor proper) is responsible for interfacing with the index AM layer. I think that this can be generalized without very much code duplication across table AMs. But it's hard. > This patch does things differently. IIUC, each TAM may produce it's own > "batch", which is then wrapped in a generic one. For example, heap > produces HeapBatch, and it gets wrapped in TupleBatch. But I think this > is fine. In the prefetching we chose to move all this code (walking the > batch items) from the AMs into the layer above, and make it AM agnostic. I think that the base index prefetching patch's current notion of index-AM-wise batches can be kept quite separate from any table AM batch concept that might be invented, either as part of what I'm working on, or in Amit's patch. It probably wouldn't be terribly difficult to get the new interface I've described to return heap tuples in whatever batch format Amit comes up with. That only has a benefit if it makes life easier for expression evaluation in higher levels of the plan tree, but it might just make sense to always do it that way. I doubt that adopting Amit's batch format will make life much harder for the heap_hot_search_buffer-batching mechanism (at least if it is generally understood that its new index scan interface's builds batches in Amit's format on a best-effort basis). --=20 Peter Geoghegan