Re: How batch processing works

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: How batch processing works
5+ messages / 2 participants
[nested] [flat]

* Re: How batch processing works
@ 2024-09-19 06:01  Ron Johnson <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Ron Johnson @ 2024-09-19 06:01 UTC (permalink / raw)
  To: Lok P <[email protected]>; +Cc: pgsql-general <[email protected]>

On Thu, Sep 19, 2024 at 1:31 AM Lok P <[email protected]> wrote:

> Hello,
> Saw multiple threads around the same , so I want some clarification. As we
> know row by row is slow by slow processing , so in heavy write systems(say
> the client app is in Java) , people asked to do DMLS in batches rather in a
> row by row fashion to minimize the chatting or context switches between
> database and client which is resource intensive. What I understand is that
> , a true batch processing means the client has to collect all the input
> bind values and  prepare the insert statement and submit to the database at
> oneshot and then commit.
>
> What it means actually and if we divide the option as below, which method
> truly does batch processing or there exists some other method for doing the
> batch processing considering postgres as backend database?
>
> I understand, the first method below is truly a row by row processing in
> which context switches happen between client and database with each row,
> whereas the second method is just batching the commits but not a true batch
> processing as it will do the same amount of context switching between the
> database and client. But regarding the third and fourth method, will both
> execute in a similar fashion in the database with the same number of
> context switches? If any other better method exists to do these inserts in
> batches? Appreciate your guidance.
>
>
> CREATE TABLE parent_table (
>     id SERIAL PRIMARY KEY,
>     name TEXT
> );
>
> CREATE TABLE child_table (
>     id SERIAL PRIMARY KEY,
>     parent_id INT REFERENCES parent_table(id),
>     value TEXT
> );
>
>
> Method-1
>
> insert into parent_table values(1,'a');
> commit;
> insert into parent_table values(2,'a');
> commit;
> insert into child_table values(1,1,'a');
> Commit;
> insert into child_table values(1,2,'a');
> commit;
>
> VS
>
> Method-2
>
> insert into parent_table values(1,'a');
> insert into parent_table values(2,'a');
> insert into child_table values(1,1,'a');
> insert into child_table values(1,2,'a');
> Commit;
>

As a former "DP" programmer, from an application point of view, this is
absolutely batch programming.

My experience was with COBOL and C, though, which were low overhead.  From
what I've seen in PG log files, JDBC is astoundingly chatty.

[snip]

>
> Method-4
>
> INSERT INTO parent_table VALUES  (1, 'a'), (2, 'a');
> INSERT INTO child_table VALUES   (1,1, 'a'), (1,2, 'a');
> commit;
>

If I knew that I had to load a structured input data file (even if it had
parent and child records), this is how I'd do it (but probably first try
and see if "in-memory COPY INTO" is such a thing).

-- 
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> crustacean!


^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: How batch processing works
@ 2024-09-19 09:24  Lok P <[email protected]>
  parent: Ron Johnson <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Lok P @ 2024-09-19 09:24 UTC (permalink / raw)
  To: Ron Johnson <[email protected]>; +Cc: pgsql-general <[email protected]>

On Thu, Sep 19, 2024 at 11:31 AM Ron Johnson <[email protected]>
wrote:

>
> [snip]
>
>>
>> Method-4
>>
>> INSERT INTO parent_table VALUES  (1, 'a'), (2, 'a');
>> INSERT INTO child_table VALUES   (1,1, 'a'), (1,2, 'a');
>> commit;
>>
>
> If I knew that I had to load a structured input data file (even if it had
> parent and child records), this is how I'd do it (but probably first try
> and see if "in-memory COPY INTO" is such a thing).
>
>

I was trying to reproduce this behaviour using row by row commit vs just
batch commit vs true batch insert as you mentioned, i am not able to see
any difference between "batch commit" and "true batch insert" response. Am
I missing anything?

CREATE TABLE debug_log (
    method1 TEXT,
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    elapsed_time INTERVAL
);

CREATE TABLE parent_table (
    id SERIAL PRIMARY KEY,
    name TEXT
);


CREATE TABLE parent_table2 (
    id SERIAL PRIMARY KEY,
    name TEXT
);


CREATE TABLE parent_table3 (
    id SERIAL PRIMARY KEY,
    name TEXT
);
DO $$
DECLARE
    num_inserts INTEGER := 100000;
    batch_size INTEGER := 50;
    start_time TIMESTAMP;
    end_time TIMESTAMP;
    elapsed_time INTERVAL;
    i INTEGER;
BEGIN
    -- Method 1: Individual Inserts with Commit after every Row
    start_time := clock_timestamp();

    FOR i IN 1..num_inserts LOOP
        INSERT INTO parent_table VALUES (i, 'a');
        COMMIT;
    END LOOP;

    end_time := clock_timestamp();
    elapsed_time := end_time - start_time;
    INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
    VALUES ('Method 1: Individual Inserts with Commit after every Row',
start_time, end_time, elapsed_time);

    -- Method 2: Individual Inserts with Commit after 100 Rows
    start_time := clock_timestamp();

    FOR i IN 1..num_inserts LOOP
        INSERT INTO parent_table2 VALUES (i, 'a');
        -- Commit after every 100 rows
        IF i % batch_size = 0 THEN
            COMMIT;
        END IF;
    END LOOP;

    -- Final commit if not already committed
   commit;

    end_time := clock_timestamp();
    elapsed_time := end_time - start_time;
    INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
    VALUES ('Method 2: Individual Inserts with Commit after 100 Rows',
start_time, end_time, elapsed_time);

    -- Method 3: Batch Inserts with Commit after all
    start_time := clock_timestamp();

    FOR i IN 1..(num_inserts / batch_size) LOOP
        INSERT INTO parent_table3 VALUES
            (1 + (i - 1) * batch_size, 'a'),
            (2 + (i - 1) * batch_size, 'a'),
            (3 + (i - 1) * batch_size, 'a'),
            (4 + (i - 1) * batch_size, 'a'),
            (5 + (i - 1) * batch_size, 'a'),
            (6 + (i - 1) * batch_size, 'a'),
            (7 + (i - 1) * batch_size, 'a'),
            (8 + (i - 1) * batch_size, 'a'),
            (9 + (i - 1) * batch_size, 'a'),
            (10 + (i - 1) * batch_size, 'a'),
            (11 + (i - 1) * batch_size, 'a'),
            (12 + (i - 1) * batch_size, 'a'),
            (13 + (i - 1) * batch_size, 'a'),
            (14 + (i - 1) * batch_size, 'a'),
            (15 + (i - 1) * batch_size, 'a'),
            (16 + (i - 1) * batch_size, 'a'),
            (17 + (i - 1) * batch_size, 'a'),
            (18 + (i - 1) * batch_size, 'a'),
            (19 + (i - 1) * batch_size, 'a'),
            (20 + (i - 1) * batch_size, 'a'),
            (21 + (i - 1) * batch_size, 'a'),
            (22 + (i - 1) * batch_size, 'a'),
            (23 + (i - 1) * batch_size, 'a'),
            (24 + (i - 1) * batch_size, 'a'),
            (25 + (i - 1) * batch_size, 'a'),
            (26 + (i - 1) * batch_size, 'a'),
            (27 + (i - 1) * batch_size, 'a'),
            (28 + (i - 1) * batch_size, 'a'),
            (29 + (i - 1) * batch_size, 'a'),
            (30 + (i - 1) * batch_size, 'a'),
            (31 + (i - 1) * batch_size, 'a'),
            (32 + (i - 1) * batch_size, 'a'),
            (33 + (i - 1) * batch_size, 'a'),
            (34 + (i - 1) * batch_size, 'a'),
            (35 + (i - 1) * batch_size, 'a'),
            (36 + (i - 1) * batch_size, 'a'),
            (37 + (i - 1) * batch_size, 'a'),
            (38 + (i - 1) * batch_size, 'a'),
            (39 + (i - 1) * batch_size, 'a'),
            (40 + (i - 1) * batch_size, 'a'),
            (41 + (i - 1) * batch_size, 'a'),
            (42 + (i - 1) * batch_size, 'a'),
            (43 + (i - 1) * batch_size, 'a'),
            (44 + (i - 1) * batch_size, 'a'),
            (45 + (i - 1) * batch_size, 'a'),
            (46 + (i - 1) * batch_size, 'a'),
            (47 + (i - 1) * batch_size, 'a'),
            (48 + (i - 1) * batch_size, 'a'),
            (49 + (i - 1) * batch_size, 'a'),
            (50 + (i - 1) * batch_size, 'a'));
COMMIT;
    END LOOP;

    COMMIT;  -- Final commit for all
    end_time := clock_timestamp();
    elapsed_time := end_time - start_time;
    INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
    VALUES ('Method 3: Batch Inserts with Commit after All', start_time,
end_time, elapsed_time);

END $$;


^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: How batch processing works
@ 2024-09-19 12:07  Ron Johnson <[email protected]>
  parent: Lok P <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Ron Johnson @ 2024-09-19 12:07 UTC (permalink / raw)
  To: pgsql-general <[email protected]>

On Thu, Sep 19, 2024 at 5:24 AM Lok P <[email protected]> wrote:

>
> On Thu, Sep 19, 2024 at 11:31 AM Ron Johnson <[email protected]>
> wrote:
>
>>
>> [snip]
>>
>>>
>>> Method-4
>>>
>>> INSERT INTO parent_table VALUES  (1, 'a'), (2, 'a');
>>> INSERT INTO child_table VALUES   (1,1, 'a'), (1,2, 'a');
>>> commit;
>>>
>>
>> If I knew that I had to load a structured input data file (even if it had
>> parent and child records), this is how I'd do it (but probably first try
>> and see if "in-memory COPY INTO" is such a thing).
>>
>>
>
> I was trying to reproduce this behaviour using row by row commit vs just
> batch commit vs true batch insert as you mentioned, i am not able to see
> any difference between "batch commit" and "true batch insert" response. Am
> I missing anything?
>
> [snip]

> DO $$
> DECLARE
>     num_inserts INTEGER := 100000;
>     batch_size INTEGER := 50;
>     start_time TIMESTAMP;
>     end_time TIMESTAMP;
>     elapsed_time INTERVAL;
>     i INTEGER;
> BEGIN
>     -- Method 1: Individual Inserts with Commit after every Row
>     start_time := clock_timestamp();
>
>     FOR i IN 1..num_inserts LOOP
>         INSERT INTO parent_table VALUES (i, 'a');
>         COMMIT;
>     END LOOP;
>
>     end_time := clock_timestamp();
>     elapsed_time := end_time - start_time;
>     INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
>     VALUES ('Method 1: Individual Inserts with Commit after every Row',
> start_time, end_time, elapsed_time);
>
>     -- Method 2: Individual Inserts with Commit after 100 Rows
>     start_time := clock_timestamp();
>
>     FOR i IN 1..num_inserts LOOP
>         INSERT INTO parent_table2 VALUES (i, 'a');
>         -- Commit after every 100 rows
>         IF i % batch_size = 0 THEN
>             COMMIT;
>         END IF;
>     END LOOP;
>
>     -- Final commit if not already committed
>    commit;
>
>     end_time := clock_timestamp();
>     elapsed_time := end_time - start_time;
>     INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
>     VALUES ('Method 2: Individual Inserts with Commit after 100 Rows',
> start_time, end_time, elapsed_time);
>
>     -- Method 3: Batch Inserts with Commit after all
>     start_time := clock_timestamp();
>
>     FOR i IN 1..(num_inserts / batch_size) LOOP
>         INSERT INTO parent_table3 VALUES
>             (1 + (i - 1) * batch_size, 'a'),
>
[snip]

>             (49 + (i - 1) * batch_size, 'a'),
>             (50 + (i - 1) * batch_size, 'a'));
> COMMIT;
>     END LOOP;
>
>     COMMIT;  -- Final commit for all
>     end_time := clock_timestamp();
>     elapsed_time := end_time - start_time;
>     INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
>     VALUES ('Method 3: Batch Inserts with Commit after All', start_time,
> end_time, elapsed_time);
>
> END $$;
>

Reproduce what behavior?

Anyway, plpgsql functions (including anonymous DO statements) are -- to
Postgresql -- single statements.  Thus, they'll be faster than
individual calls..

An untrusted language like plpython3u might speed things up even more, if
you have to read a heterogeneous external file and insert all the records
into the db.

-- 
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> crustacean!


^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: How batch processing works
@ 2024-09-19 13:12  Lok P <[email protected]>
  parent: Ron Johnson <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Lok P @ 2024-09-19 13:12 UTC (permalink / raw)
  To: Ron Johnson <[email protected]>; +Cc: pgsql-general <[email protected]>

On Thu, Sep 19, 2024 at 5:40 PM Ron Johnson <[email protected]> wrote:

> On Thu, Sep 19, 2024 at 5:24 AM Lok P <[email protected]> wrote:
>
>>
>>
>> [snip]
>
>> DO $$
>> DECLARE
>>     num_inserts INTEGER := 100000;
>>     batch_size INTEGER := 50;
>>     start_time TIMESTAMP;
>>     end_time TIMESTAMP;
>>     elapsed_time INTERVAL;
>>     i INTEGER;
>> BEGIN
>>     -- Method 1: Individual Inserts with Commit after every Row
>>     start_time := clock_timestamp();
>>
>>     FOR i IN 1..num_inserts LOOP
>>         INSERT INTO parent_table VALUES (i, 'a');
>>         COMMIT;
>>     END LOOP;
>>
>>     end_time := clock_timestamp();
>>     elapsed_time := end_time - start_time;
>>     INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
>>     VALUES ('Method 1: Individual Inserts with Commit after every Row',
>> start_time, end_time, elapsed_time);
>>
>>     -- Method 2: Individual Inserts with Commit after 100 Rows
>>     start_time := clock_timestamp();
>>
>>     FOR i IN 1..num_inserts LOOP
>>         INSERT INTO parent_table2 VALUES (i, 'a');
>>         -- Commit after every 100 rows
>>         IF i % batch_size = 0 THEN
>>             COMMIT;
>>         END IF;
>>     END LOOP;
>>
>>     -- Final commit if not already committed
>>    commit;
>>
>>     end_time := clock_timestamp();
>>     elapsed_time := end_time - start_time;
>>     INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
>>     VALUES ('Method 2: Individual Inserts with Commit after 100 Rows',
>> start_time, end_time, elapsed_time);
>>
>>     -- Method 3: Batch Inserts with Commit after all
>>     start_time := clock_timestamp();
>>
>>     FOR i IN 1..(num_inserts / batch_size) LOOP
>>         INSERT INTO parent_table3 VALUES
>>             (1 + (i - 1) * batch_size, 'a'),
>>
> [snip]
>
>>             (49 + (i - 1) * batch_size, 'a'),
>>             (50 + (i - 1) * batch_size, 'a'));
>> COMMIT;
>>     END LOOP;
>>
>>     COMMIT;  -- Final commit for all
>>     end_time := clock_timestamp();
>>     elapsed_time := end_time - start_time;
>>     INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
>>     VALUES ('Method 3: Batch Inserts with Commit after All', start_time,
>> end_time, elapsed_time);
>>
>> END $$;
>>
>
> Reproduce what behavior?
>
> Anyway, plpgsql functions (including anonymous DO statements) are -- to
> Postgresql -- single statements.  Thus, they'll be faster than
> individual calls..
>
> An untrusted language like plpython3u might speed things up even more, if
> you have to read a heterogeneous external file and insert all the records
> into the db.
>

Here if you see my script , the method-1 is doing commit after each row
insert. And method-2 is doing a batch commit i.e. commit after every "50"
row. And method-3 is doing a true batch insert i.e. combining all the 50
values in one insert statement and submitting to the database in oneshot
and then COMMIT it, so the context switching will be a lot less. So I was
expecting Method-3 to be the fastest way to insert the rows here, but the
response time shows the same response time for Method-2 and method-3.
Method-1 is the slowest through.


^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: How batch processing works
@ 2024-09-20 05:02  Lok P <[email protected]>
  parent: Lok P <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: Lok P @ 2024-09-20 05:02 UTC (permalink / raw)
  To: Ron Johnson <[email protected]>; +Cc: pgsql-general <[email protected]>

Below are the results for the posted methods. Tested it on local and it
gave no difference in timing between the method-2 andmethod-3. Failed to
run in dbfiddle somehow.

Also I was initially worried if adding the trigger to the our target table,
will worsen the performance as because , it will make all the execution to
"row by row" rather a true batch insert(method-3 as posted) as there will
be more number of context switches , but it seems it will still be doing
the batch commits(like the way its in method-2). So as per that , we won't
lose any performance as such. Is this understanding correct?


*Method-1- 00:01:44.48*

*Method-2- 00:00:02.67*

*Method-3- 00:00:02.39*

https://gist.github.com/databasetech0073/8e9106757d751358c0c0c65a2374dbc6

On Thu, Sep 19, 2024 at 6:42 PM Lok P <[email protected]> wrote:

>
>
> On Thu, Sep 19, 2024 at 5:40 PM Ron Johnson <[email protected]>
> wrote:
>
>> On Thu, Sep 19, 2024 at 5:24 AM Lok P <[email protected]> wrote:
>>
>>>
>>>
>>> [snip]
>>
>>> DO $$
>>> DECLARE
>>>     num_inserts INTEGER := 100000;
>>>     batch_size INTEGER := 50;
>>>     start_time TIMESTAMP;
>>>     end_time TIMESTAMP;
>>>     elapsed_time INTERVAL;
>>>     i INTEGER;
>>> BEGIN
>>>     -- Method 1: Individual Inserts with Commit after every Row
>>>     start_time := clock_timestamp();
>>>
>>>     FOR i IN 1..num_inserts LOOP
>>>         INSERT INTO parent_table VALUES (i, 'a');
>>>         COMMIT;
>>>     END LOOP;
>>>
>>>     end_time := clock_timestamp();
>>>     elapsed_time := end_time - start_time;
>>>     INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
>>>     VALUES ('Method 1: Individual Inserts with Commit after every Row',
>>> start_time, end_time, elapsed_time);
>>>
>>>     -- Method 2: Individual Inserts with Commit after 100 Rows
>>>     start_time := clock_timestamp();
>>>
>>>     FOR i IN 1..num_inserts LOOP
>>>         INSERT INTO parent_table2 VALUES (i, 'a');
>>>         -- Commit after every 100 rows
>>>         IF i % batch_size = 0 THEN
>>>             COMMIT;
>>>         END IF;
>>>     END LOOP;
>>>
>>>     -- Final commit if not already committed
>>>    commit;
>>>
>>>     end_time := clock_timestamp();
>>>     elapsed_time := end_time - start_time;
>>>     INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
>>>     VALUES ('Method 2: Individual Inserts with Commit after 100 Rows',
>>> start_time, end_time, elapsed_time);
>>>
>>>     -- Method 3: Batch Inserts with Commit after all
>>>     start_time := clock_timestamp();
>>>
>>>     FOR i IN 1..(num_inserts / batch_size) LOOP
>>>         INSERT INTO parent_table3 VALUES
>>>             (1 + (i - 1) * batch_size, 'a'),
>>>
>> [snip]
>>
>>>             (49 + (i - 1) * batch_size, 'a'),
>>>             (50 + (i - 1) * batch_size, 'a'));
>>> COMMIT;
>>>     END LOOP;
>>>
>>>     COMMIT;  -- Final commit for all
>>>     end_time := clock_timestamp();
>>>     elapsed_time := end_time - start_time;
>>>     INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
>>>     VALUES ('Method 3: Batch Inserts with Commit after All', start_time,
>>> end_time, elapsed_time);
>>>
>>> END $$;
>>>
>>
>> Reproduce what behavior?
>>
>> Anyway, plpgsql functions (including anonymous DO statements) are -- to
>> Postgresql -- single statements.  Thus, they'll be faster than
>> individual calls..
>>
>> An untrusted language like plpython3u might speed things up even more, if
>> you have to read a heterogeneous external file and insert all the records
>> into the db.
>>
>
> Here if you see my script , the method-1 is doing commit after each row
> insert. And method-2 is doing a batch commit i.e. commit after every "50"
> row. And method-3 is doing a true batch insert i.e. combining all the 50
> values in one insert statement and submitting to the database in oneshot
> and then COMMIT it, so the context switching will be a lot less. So I was
> expecting Method-3 to be the fastest way to insert the rows here, but the
> response time shows the same response time for Method-2 and method-3.
> Method-1 is the slowest through.
>


^ permalink  raw  reply  [nested|flat] 5+ messages in thread

end of thread, other threads:[~2024-09-20 05:02 UTC | newest]

Thread overview: 5+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-09-19 06:01 Re: How batch processing works Ron Johnson <[email protected]>
2024-09-19 09:24 ` Lok P <[email protected]>
2024-09-19 12:07   ` Ron Johnson <[email protected]>
2024-09-19 13:12     ` Lok P <[email protected]>
2024-09-20 05:02       ` Lok P <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox