Subject: Re: New Table Access Methods for Multi and Single Inserts
To: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Cc: Justin Pryzby <pryzby@telsasoft.com>,
 PostgreSQL-development <pgsql-hackers@postgresql.org>,
 Andres Freund <andres@anarazel.de>, Paul Guo <guopa@vmware.com>,
 Jeff Davis <pgsql@j-davis.com>
References: 
 <CALj2ACVi9eTRYR=gdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ@mail.gmail.com>
 <20201217050522.GU30237@telsasoft.com>
 <CALj2ACVgT1iocd5nQ+rEmqt3xcCONkR037qbc8PiojdR39Ag=w@mail.gmail.com>
 <20201217204442.GX30237@telsasoft.com>
 <CALj2ACW3BC5kgdffZ2LD_CT2wQoXVc29kGB74SVWnGZ=UFqcAQ@mail.gmail.com>
 <20201218175439.GA30237@telsasoft.com> <20201221074725.GF30237@telsasoft.com>
 <CALj2ACWMnZZCu=G0PJkEeYYicKeuJ-X=SU19i6vQ1+=uXz8u0Q@mail.gmail.com>
 <20201225023958.GW30237@telsasoft.com>
 <CALj2ACVDtYYRYD2SC+X2ALOUkhnUcgC7RLxiEYVWW2HxxrfRww@mail.gmail.com>
 <96eaa813-4ad6-e80a-04a4-cc8082d356dd@swarm64.com>
 <CALj2ACVsiAZMsP8p5MPg6SSEtoMFFaiAa6j2AFtEQJDhfbgs3Q@mail.gmail.com>
 <508af801-6356-d36b-1867-011ac6df8f55@swarm64.com>
 <CALj2ACUmL3+xLFtVbdcNpo_=ubdi=_nsp6MNq__xWwL=NGkdgA@mail.gmail.com>
From: Luc Vlaming <luc@swarm64.com>
Message-ID: <c2ba9c52-8c5a-13f0-7d50-b569f7c93840@swarm64.com>
Date: Tue, 12 Jan 2021 09:03:33 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.5.0
In-Reply-To: 
 <CALj2ACUmL3+xLFtVbdcNpo_=ubdi=_nsp6MNq__xWwL=NGkdgA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
Received: from [10.1.0.5] (213.61.129.43) by
 LEJPR01CA0051.DEUPRD01.PROD.OUTLOOK.DE (2a01:4180:c012:1::19) with Microsoft
 SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3763.2 via Frontend
 Transport; Tue, 12 Jan 2021 08:03:35 +0000
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id: d5eb569a-9ed4-42fe-bda1-08d8b6d09029
X-MS-TrafficTypeDiagnostic: FRXPR01MB0872:
X-Microsoft-Antispam-PRVS: 
	<FRXPR01MB0872177417A5DF0C1D9D9103B7AA0@FRXPR01MB0872.DEUPRD01.PROD.OUTLOOK.DE>
X-MS-Oob-TLC-OOBClassifiers: OLM:10000;
X-MS-Exchange-SenderADCheck: 1
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
	Gw5xrDzNH1hYU35IM1DO2y9sWqUvuK+Fu0iE4Mz9SM/vQ6h2kMBfbKqNWVu4DFCXujjzPU9VLqciNcV5vxHWcPqeR7CDiSwbEG9e0R53FWl4bTDYd/hDSdNKj1eMFzb5OEiav4NUEv0ro6g5lggAqjSimUCUEJIEHYn7ofFZHh/DCzRn8jsPSpc6JtZbCJWyrX/X3TGdFUJqE7QEtPqLjDqQXb2JcOyIKqMlgiWSoHGpvpWOVQ54arJ8O8HVKPqgmXSy3mE9QF1wUI6TyegGXtmWXWusQgKDqE27EcSlbdT8aQr+vfttr+E5goAus+ygH9rYkpzN8O5O3V8YTjZp0mHL1lQzwtP9BhI3v8+SfCV7aO+PZ7ikSZw/rZkQKmNld0kl6LR+6Mb/Z7AucXkbVoIXGVArEo/glmY6yyPR3oRj7zQOCFz/9RSshsPYR4hJQJeEeud6Ev7f4+JDr919Va8gI04e6tw35kRQR3BdMfw=
X-Forefront-Antispam-Report: 
	CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:FRXPR01MB0456.DEUPRD01.PROD.OUTLOOK.DE;PTR:;CAT:NONE;SFS:(366004)(136003)(376002)(39830400003)(346002)(396003)(54906003)(16576012)(5660300002)(83380400001)(956004)(8936002)(2616005)(66556008)(8676002)(86362001)(966005)(52116002)(55236004)(26005)(36756003)(16526019)(31686004)(186003)(53546011)(316002)(2906002)(31696002)(478600001)(66946007)(6916009)(4326008)(66476007)(43740500002);DIR:OUT;SFP:1101;
X-MS-Exchange-AntiSpam-MessageData: 
	=?utf-8?B?Q2hnekNHSFFheHJxdGk2OXJRbmRPWldGd091ZVZmM3pIUjZ3ck1YMksydWR4?=
 =?utf-8?B?TUtiZTNtK1FHWExuak5vTDAzZFJYYmthdGpvOGNBUFdlZ1RPSmhRSHFsOGVP?=
 =?utf-8?B?cnVHQmtRZ0NSaTdSaXlxTy9Vb0tMN0Z3UXU5bS9hM3lHMEgxTFhmdWhRYkdW?=
 =?utf-8?B?TXA0VW12M2dGMjUydDY0aEJwMG1TQjR6NXREajd2OXVRcmRvUUF3VnkrN3Bk?=
 =?utf-8?B?WjFQWFdUbm5IVjVqRFVFU2Nrb3o2Q1dXTUtKMVZtTitIQWQ5VVhrcFhHZkk2?=
 =?utf-8?B?anVLZVhYRDAyQ0dxK0RoaHh1VE5aRXd0aU1wZHhLbXkrRWltTHpreEdiUmFv?=
 =?utf-8?B?bEV2d2dUQ0Vtb21WK1J2azYwdEZway81QVd5d0JmMUV3OGNkOG12NnlHSmt6?=
 =?utf-8?B?T0Y5bnRHdEZ2Y2VOSG9FQU1PalB0cEI4MUF4dXE5SE9DMXl4aURBVXdtV3JQ?=
 =?utf-8?B?VzVxd1FvWHB3dGR0OXZCTitrZWRpcUdoN0JjcEpmNWRCT2o4ZEVCS1V3d3FC?=
 =?utf-8?B?NTExQzVNNURmME5SaGlwbXl6ZGRxYnE2TUFURmxlSnJOVjhoaFphWWI5QUMv?=
 =?utf-8?B?azRsYndPNWlINlRMeTFsUXA2anJlV2ZTK0hrRDJvMXRMTy9kU3o4d1FQWjFk?=
 =?utf-8?B?WC8yaXdRdUpHdDFOMmI0ZlNYai8wK1NEa1g5Qm1RdEh1amxRZG96cTFlQW41?=
 =?utf-8?B?eVJnVytHdDhoM1JkRFpiZFlOcjdVdTEzcHdRK2dNOXhHZ1M4S20weDlUVDdp?=
 =?utf-8?B?SUh6dHk3czJ5ZGs4UnVURmI0QmY1bDgxOFFEWUwxTDdmWVJsSUhnWWxsUTZs?=
 =?utf-8?B?eHJReVB4bkg5RUFicE9vMWM5RklvcW5sMUkrc0p6UVRxOWtUckdPenRGU3Zr?=
 =?utf-8?B?OVlMTzRRQkxFZFN5bEhSMmExdnN6VGw2MUYzRUwzeDVCUmxBSXg0RnYrUFVX?=
 =?utf-8?B?TExwL2RrZ3NRbm5qOENQTmYvM2l0SXZZdDhuQk1SUmFNbm9oRmVTYSt1MnBG?=
 =?utf-8?B?L2pORnVkb2pIK0JLMWEzOHluZXBpM0dxQjh4d2hZaVFOMEllcGdhNDFWRzJx?=
 =?utf-8?B?b3NxV3NobUR4cnRQNVZYTHhYbHB4NCtaanhxbzFFNUxBdmorRk1hckI3eTRT?=
 =?utf-8?B?OUMxdDY0ZFBJZzlPQ0NXSVorWmhTM1VseXFRUnpsNENmOVd2WEtOUXZPdmdr?=
 =?utf-8?B?UmJneDB1bThMay9wNGtnQTQxWi9HSjE3bXJIeXYxU2RERFV3Y1dpcmhlUWdn?=
 =?utf-8?B?a3BGa1FkanYwMVo5enU1enlrd3VidXljSTg5anZDcjkvQWhZNFJaKzdadkxE?=
 =?utf-8?Q?hxF1lIH8SkWwfhlR6xeurD/fDYUE2UCPdW?=
X-OriginatorOrg: swarm64.com
X-MS-Exchange-CrossTenant-AuthSource: FRXPR01MB0456.DEUPRD01.PROD.OUTLOOK.DE
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 12 Jan 2021 08:03:36.0232
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 6a3a733d-0e05-4774-8f9e-4fc6a12face6
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 d5eb569a-9ed4-42fe-bda1-08d8b6d09029
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 A7j1LlBE/hW9LC0TI6frp4foHHuRu80lbw/qeETruEW5IN5stbv1k1jLugrF1LyK
X-MS-Exchange-Transport-CrossTenantHeadersStamped: FRXPR01MB0872
List-Id: <pgsql-hackers.lists.postgresql.org>
List-Help: <https://lists.postgresql.org/manage/>
List-Subscribe: <https://lists.postgresql.org/manage/>
List-Post: <mailto:pgsql-hackers@lists.postgresql.org>
List-Owner: <mailto:pgsql-hackers-owner@lists.postgresql.org>
List-Archive: <https://www.postgresql.org/list/pgsql-hackers>
Precedence: bulk

On 06-01-2021 14:06, Bharath Rupireddy wrote:
> On Wed, Jan 6, 2021 at 12:56 PM Luc Vlaming <luc@swarm64.com> wrote:
>> The main reason for me for wanting a single API is that I would like the
>> decision of using single or multi inserts to move to inside the tableam.
>> For e.g. a heap insert we might want to put the threshold at e.g. 100
>> rows so that the overhead of buffering the tuples is actually
>> compensated. For other tableam this logic might also be quite different,
>> and I think therefore that it shouldn't be e.g. COPY or CTAS deciding
>> whether or not multi inserts should be used. Because otherwise the thing
>> we'll get is that there will be tableams that will ignore this flag and
>> do their own thing anyway. I'd rather have an API that gives all
>> necessary information to the tableam and then make the tableam do "the
>> right thing".
>>
>> Another reason I'm suggesting this API is that I would expect that the
>> begin is called in a different place in the code for the (multiple)
>> inserts than the actual insert statement.
>> To me conceptually the begin and end are like e.g. the executor begin
>> and end: you prepare the inserts with the knowledge you have at that
>> point. I assumed (wrongly?) that during the start of the statement one
>> knows best how many rows are coming; and then the actual insertion of
>> the row doesn't have to deal anymore with multi/single inserts, choosing
>> when to buffer or not, because that information has already been given
>> during the initial phase. One of the reasons this is appealing to me is
>> that e.g. in [1] there was discussion on when to switch to a multi
>> insert state, and imo this should be up to the tableam.
> 
> Agree that whether to go with the multi or single inserts should be
> completely left to tableam implementation, we, as callers of those API
> just need to inform whether we expect single or multiple rows, and it
> should be left to tableam implementation whether to actually go with
> buffering or single inserts. ISTM that it's an elegant way of making
> the API generic and abstracting everything from the callers. What I
> wonder is how can we know in advance the expected row count that we
> need to pass in to heap_insert_begin()? IIUC, we can not estimate the
> upcoming rows in COPY, Insert Into Select, or Refresh Mat View or some
> other insert queries?  Of course, we can look at the planner's
> estimated row count for the selects in COPY, Insert Into Select or
> Refresh Mat View after the planning, but to me that's not something we
> can depend on and pass in the row count to the insert APIs.
> 
> When we don't know the expected row count, why can't we(as callers of
> the APIs) tell the APIs something like, "I'm intending to perform
> multi inserts, so if possible and if you have a mechanism to buffer
> the slots, do it, otherwise insert the tuples one by one, or else do
> whatever you want to do with the tuples I give it you". So, in case of
> COPY we can ask the API for multi inserts and call heap_insert_begin()
> and heap_insert_v2().
> 

I thought that when it is available (because of planning) it would be 
nice to pass it in. If you don't know you could pass in a 1 for doing 
single inserts, and e.g. -1 or max-int for streaming. The reason I 
proposed it is so that tableam's have as much knowledge as posisble to 
do the right thing. is_multi does also work of course but is just 
somewhat less informative.

What to me seemed somewhat counterintuitive is that with the proposed 
API it is possible to say is_multi=true and then still call 
heap_insert_v2 to do a single insert.

> Given the above explanation, I still feel bool is_multi would suffice.
> 
> Thoughts?
> 
> On dynamically, switching from single to multi inserts, this can be
> done by heap_insert_v2 itself. The way I think it's possible is that,
> say we have some threshold row count 1000(can be a macro)  after
> inserting those many tuples, heap_insert_v2 can switch to buffering
> mode.

For that I thought it'd be good to use the expected row count, but yeah 
dynamically switching also works and might work better if the expected 
row counts are usually off.

> 
> Thoughts?
> 
>> Which would make the code something like:
>>
>> void
>> heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
>> {
>>          TupleTableSlot  *batchslot;
>>          HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
>>          Size sz;
>>
>>          Assert(mistate && mistate->slots);
>>
>>          if (mistate->slots[mistate->cur_slots] == NULL)
>>                  mistate->slots[mistate->cur_slots] =
>>                                                                          table_slot_create(state->rel, NULL);
>>
>>          batchslot = mistate->slots[mistate->cur_slots];
>>
>>          ExecClearTuple(batchslot);
>>          ExecCopySlot(batchslot, slot);
>>
>>          /*
>>           * Calculate the tuple size after the original slot is copied, because the
>>           * copied slot type and the tuple size may change.
>>           */
>>          sz = GetTupleSize(batchslot, mistate->max_size);
>>
>>          Assert(sz > 0);
>>
>>          mistate->cur_slots++;
>>          mistate->cur_size += sz;
>>
>>          if (mistate->cur_slots >= mistate->max_slots ||
>>                  mistate->cur_size >= mistate->max_size)
>>                  heap_multi_insert_flush(state);
>> }
> 
> I think clearing tuples before copying the slot as you suggested may
> work without the need of clear_slots flag.

ok, cool :)

> 
>>
>>>   > Also, why do we want to do ExecClearTuple() anyway? Isn't
>>>   > it good enough that the next call to ExecCopySlot will effectively clear
>>>   > it out?
>>>
>>> For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the
>>> slot before copying. But, for buffer heap slots, the
>>> tts_buffer_heap_copyslot does not always clear the destination slot, see
>>> below. If we fall into else condition, we might get some issues. And
>>> also note that, once the slot is cleared in ExecClearTuple, it will not
>>> be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be
>>> false. That is why, let's have ExecClearTuple as is.
>>>
>> I had no idea the buffer heap slot doesn't unconditionally clear out the
>> slot :( So yes lets call it unconditionally ourselves. See also
>> suggestion above.
> 
> Yeah, we will clear the tuple slot before copy to be on the safer side.
> 

ok

>>>       /*
>>>        * If the source slot is of a different kind, or is a buffer slot
>>> that has
>>>        * been materialized / is virtual, make a new copy of the tuple.
>>> Otherwise
>>>        * make a new reference to the in-buffer tuple.
>>>        */
>>>       if (dstslot->tts_ops != srcslot->tts_ops ||
>>>           TTS_SHOULDFREE(srcslot) ||
>>>           !bsrcslot->base.tuple)
>>>       {
>>>           MemoryContext oldContext;
>>>
>>>           ExecClearTuple(dstslot);
>>>       }
>>>       else
>>>       {
>>>           Assert(BufferIsValid(bsrcslot->buffer));
>>>
>>>           tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple,
>>>                                       bsrcslot->buffer, false);
>>>
>>>   > - flushed -> why is this a stored boolean? isn't this indirectly encoded
>>>   > by cur_slots/cur_size == 0?
>>>
>>> Note that cur_slots is in HeapMultiInsertState and outside of the new
>>> APIs i.e. in TableInsertState, mistate is a void pointer, and we can't
>>> really access the cur_slots. I mean, we can access but we need to be
>>> dereferencing using the tableam kind. Instead of  doing all of that, to
>>> keep the API cleaner, I chose to have a boolean in the TableInsertState
>>> which we can see and use outside of the new APIs. Hope that's fine.
>>>
>> So you mean the flushed variable is actually there to tell the user of
>> the API that they are supposed to call flush before end? Why can't the
>> end call flush itself then? I guess I completely misunderstood the
>> purpose of table_multi_insert_flush being public. I had assumed it is
>> there to from the usage site indicate that now would be a good time to
>> flush, e.g. because of a statement ending or something. I had not
>> understood this is a requirement that its always required to do
>> table_multi_insert_flush + table_insert_end.
>> IMHO I would hide this from the callee, given that you would only really
>> call flush yourself when you immediately after would call end, or are
>> there other cases where one would be required to explicitly call flush?
> 
> We need to know outside the multi_insert API whether the buffered
> slots in case of multi inserts are flushed. Reason is that if we have
> indexes or after row triggers, currently we call ExecInsertIndexTuples
> or ExecARInsertTriggers on the buffered slots outside the API in a
> loop after the flush.
> 
> If we agree on removing heap_multi_insert_v2 API and embed that logic
> inside heap_insert_v2, then we can do this - pass the required
> information and the functions ExecInsertIndexTuples and
> ExecARInsertTriggers as callbacks so that, whether or not
> heap_insert_v2 choses single or multi inserts, it can callback these
> functions with the required information passed after the flush. We can
> add the callback and required information into TableInsertState. But,
> I'm not quite sure, we would make ExecInsertIndexTuples and
> ExecARInsertTriggers. And in
> 
> If we don't want to go with callback way, then at least we need to
> know whether or not heap_insert_v2 has chosen multi inserts, if yes,
> the buffered slots array, and the number of current buffered slots,
> whether they are flushed or not in the TableInsertState. Then,
> eventually, we might need all the HeapMultiInsertState info in the
> TableInsertState.
> 

To me the callback API seems cleaner, that on heap_insert_begin we can 
pass in a callback that is called on every flushed slot, or only on 
multi-insert flushes. Is there a reason it would only be done for 
multi-insert flushes or can it be generic?

> Thoughts?
> 
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
> 

Hi,

Replied inline.

Kind regards,
Luc