MIME-Version: 1.0
References: <CAKna9VY2v0XsDberzbJXZ4MqEW1RUtD0L_Mis_vrgEQWZgH0gg@mail.gmail.com>
 <4178E73A-24F5-4E3C-92F6-1532D8102C3E@kleczek.org> <CAKna9Vbt1VJu7Oa8FTWasgby+-kJn7omOhbfmWzkdpVwBiqNzQ@mail.gmail.com>
 <20240921143629.t2x37xfczeeunpnf@hjp.at>
In-Reply-To: <20240921143629.t2x37xfczeeunpnf@hjp.at>
From: Lok P <loknath.73@gmail.com>
Date: Sat, 21 Sep 2024 20:55:13 +0530
Message-ID: <CAKna9VZA5x1g7-4j8cQrW1ByqX5jtPwY7sY2MHuRgrxrOu4LBg@mail.gmail.com>
Subject: Re: How batch processing works
To: pgsql-general@lists.postgresql.org
Content-Type: multipart/alternative; boundary="000000000000d36f850622a2c30f"
Archived-At: <https://www.postgresql.org/message-id/CAKna9VZA5x1g7-4j8cQrW1ByqX5jtPwY7sY2MHuRgrxrOu4LBg%40mail.gmail.com>
Precedence: bulk

--000000000000d36f850622a2c30f
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sat, Sep 21, 2024 at 8:07=E2=80=AFPM Peter J. Holzer <hjp-pgsql@hjp.at> =
wrote:

> On 2024-09-21 16:44:08 +0530, Lok P wrote:
> > But wondering why we don't see any difference in performance between
> method-2
> > and method-3 above.
>
> The code runs completely inside the database. So there isn't much
> difference between a single statement which inserts 50 rows and 50
> statements which insert 1 row each. The work to be done is (almost) the
> same.
>
> This changes once you consider an application which runs outside of the
> database (maybe even on a different host). Such an application has to
> wait for the result of each statement before it can send the next one.
> Now it makes a difference whether you are waiting 50 times for a
> statement which does very little or just once for a statement which does
> more work.
>
> > So does it mean that,I am testing this in a wrong way or
>
> That depends on what you want to test. If you are interested in the
> behaviour of stored procedures, the test is correct. If you want to know
> about the performance of a database client (whether its written in Java,
> Python, Go or whatever), this is the wrong test. You have to write the
> test in your target language and run it on the client system to get
> realistic results (for example, the round-trip times will be a lot
> shorter if the client and database are on the same computer than when
> one is in Europe and the other in America).
>
> For example, here are the three methods as Python scripts:
>
>
> -------------------------------------------------------------------------=
--------------------------
> #!/usr/bin/python3
>
> import time
> import psycopg2
>
> num_inserts =3D 10_000
>
> db =3D psycopg2.connect()
> csr =3D db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t text)")
>
> start_time =3D time.monotonic()
> for i in range(1, num_inserts+1):
>     csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
>     db.commit()
> end_time =3D time.monotonic()
> elapsed_time =3D end_time - start_time
> print(f"Method 1: Individual Inserts with Commit after every Row:
> {elapsed_time:.3} seconds")
>
> # vim: tw=3D99
>
> -------------------------------------------------------------------------=
--------------------------
> #!/usr/bin/python3
>
> import time
> import psycopg2
>
> num_inserts =3D 10_000
> batch_size =3D 50
>
> db =3D psycopg2.connect()
> csr =3D db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t text)")
> db.commit()
>
> start_time =3D time.monotonic()
> for i in range(1, num_inserts+1):
>     csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
>     if i % batch_size =3D=3D 0:
>         db.commit()
> db.commit()
> end_time =3D time.monotonic()
> elapsed_time =3D end_time - start_time
> print(f"Method 2: Individual Inserts with Commit after {batch_size}  Rows=
:
> {elapsed_time:.3} seconds")
>
> # vim: tw=3D99
>
> -------------------------------------------------------------------------=
--------------------------
> #!/usr/bin/python3
>
> import itertools
> import time
> import psycopg2
>
> num_inserts =3D 10_000
> batch_size =3D 50
>
> db =3D psycopg2.connect()
> csr =3D db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t text)")
> db.commit()
>
> start_time =3D time.monotonic()
> batch =3D []
> for i in range(1, num_inserts+1):
>     batch.append((i, 'a'))
>     if i % batch_size =3D=3D 0:
>         q =3D "insert into parent_table values" + ",".join(["(%s, %s)"] *
> len(batch))
>         params =3D list(itertools.chain.from_iterable(batch))
>         csr.execute(q, params)
>         db.commit()
>         batch =3D []
> if batch:
>     q =3D "insert into parent_table values" + ",".join(["(%s, %s)"] *
> len(batch))
>     csr.execute(q, list(itertools.chain(batch)))
>     db.commit()
>     batch =3D []
>
> end_time =3D time.monotonic()
> elapsed_time =3D end_time - start_time
> print(f"Method 3: Batch Inserts ({batch_size})  with Commit after each
> batch: {elapsed_time:.3} seconds")
>
> # vim: tw=3D99
>
> -------------------------------------------------------------------------=
--------------------------
>
> On my laptop, method2 is about twice as fast as method3. But if I
> connect to a database on the other side of the city, method2 is now more
> than 16 times faster than method3 . Simply because the delay in
> communication is now large compared to the time it takes to insert those
> rows.
>
>
Thank you so much.
I was expecting method-3(batch insert) to be the fastest or atleast as you
said perform with similar speed as method-2 (row by row insert with batch
commit) if we do it within the procedure inside the database. But because
the context switching will be minimal in method-3 as it will prepare the
insert and submit to the database in one shot in one DB call, so it should
be a bit fast. But from your figures , it appears to be the opposite ,
i.e.method-2 is faster than method-3. Not able to understand the reason
though. So in this case then ,it appears we can follow method-2 as that is
cheaper in regards to less code change , i.e. just shifting the commit
points without any changes for doing the batch insert.

Btw,Do you have any thoughts,  why method-2 is faster as compared to
method-3 in your test?

--000000000000d36f850622a2c30f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Sat, Sep 21, 2024 at 8:07=E2=80=AF=
PM Peter J. Holzer &lt;<a href=3D"mailto:hjp-pgsql@hjp.at">hjp-pgsql@hjp.at=
</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">=
On 2024-09-21 16:44:08 +0530, Lok P wrote:<br>
&gt; But wondering why we don&#39;t see any difference in performance=C2=A0=
between method-2<br>
&gt; and method-3 above.<br>
<br>
The code runs completely inside the database. So there isn&#39;t much<br>
difference between a single statement which inserts 50 rows and 50<br>
statements which insert 1 row each. The work to be done is (almost) the<br>
same.<br>
<br>
This changes once you consider an application which runs outside of the<br>
database (maybe even on a different host). Such an application has to<br>
wait for the result of each statement before it can send the next one.<br>
Now it makes a difference whether you are waiting 50 times for a<br>
statement which does very little or just once for a statement which does<br=
>
more work.<br>
<br>
&gt; So does it mean that,I am testing this in a wrong way or<br>
<br>
That depends on what you want to test. If you are interested in the<br>
behaviour of stored procedures, the test is correct. If you want to know<br=
>
about the performance of a database client (whether its written in Java,<br=
>
Python, Go or whatever), this is the wrong test. You have to write the<br>
test in your target language and run it on the client system to get<br>
realistic results (for example, the round-trip times will be a lot<br>
shorter if the client and database are on the same computer than when<br>
one is in Europe and the other in America).<br>
<br>
For example, here are the three methods as Python scripts:<br>
<br>
---------------------------------------------------------------------------=
------------------------<br>
#!/usr/bin/python3<br>
<br>
import time<br>
import psycopg2<br>
<br>
num_inserts =3D 10_000<br>
<br>
db =3D psycopg2.connect()<br>
csr =3D db.cursor()<br>
<br>
csr.execute(&quot;drop table if exists parent_table&quot;)<br>
csr.execute(&quot;create table parent_table (id int primary key, t text)&qu=
ot;)<br>
<br>
start_time =3D time.monotonic()<br>
for i in range(1, num_inserts+1):<br>
=C2=A0 =C2=A0 csr.execute(&quot;insert into parent_table values(%s, %s)&quo=
t;, (i, &#39;a&#39;))<br>
=C2=A0 =C2=A0 db.commit()<br>
end_time =3D time.monotonic()<br>
elapsed_time =3D end_time - start_time<br>
print(f&quot;Method 1: Individual Inserts with Commit after every Row: {ela=
psed_time:.3} seconds&quot;)<br>
<br>
# vim: tw=3D99<br>
---------------------------------------------------------------------------=
------------------------<br>
#!/usr/bin/python3<br>
<br>
import time<br>
import psycopg2<br>
<br>
num_inserts =3D 10_000<br>
batch_size =3D 50<br>
<br>
db =3D psycopg2.connect()<br>
csr =3D db.cursor()<br>
<br>
csr.execute(&quot;drop table if exists parent_table&quot;)<br>
csr.execute(&quot;create table parent_table (id int primary key, t text)&qu=
ot;)<br>
db.commit()<br>
<br>
start_time =3D time.monotonic()<br>
for i in range(1, num_inserts+1):<br>
=C2=A0 =C2=A0 csr.execute(&quot;insert into parent_table values(%s, %s)&quo=
t;, (i, &#39;a&#39;))<br>
=C2=A0 =C2=A0 if i % batch_size =3D=3D 0:<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 db.commit()<br>
db.commit()<br>
end_time =3D time.monotonic()<br>
elapsed_time =3D end_time - start_time<br>
print(f&quot;Method 2: Individual Inserts with Commit after {batch_size}=C2=
=A0 Rows: {elapsed_time:.3} seconds&quot;)<br>
<br>
# vim: tw=3D99<br>
---------------------------------------------------------------------------=
------------------------<br>
#!/usr/bin/python3<br>
<br>
import itertools<br>
import time<br>
import psycopg2<br>
<br>
num_inserts =3D 10_000<br>
batch_size =3D 50<br>
<br>
db =3D psycopg2.connect()<br>
csr =3D db.cursor()<br>
<br>
csr.execute(&quot;drop table if exists parent_table&quot;)<br>
csr.execute(&quot;create table parent_table (id int primary key, t text)&qu=
ot;)<br>
db.commit()<br>
<br>
start_time =3D time.monotonic()<br>
batch =3D []<br>
for i in range(1, num_inserts+1):<br>
=C2=A0 =C2=A0 batch.append((i, &#39;a&#39;))<br>
=C2=A0 =C2=A0 if i % batch_size =3D=3D 0:<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 q =3D &quot;insert into parent_table values&quo=
t; + &quot;,&quot;.join([&quot;(%s, %s)&quot;] * len(batch))<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 params =3D list(itertools.chain.from_iterable(b=
atch))<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 csr.execute(q, params)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 db.commit()<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 batch =3D []<br>
if batch:<br>
=C2=A0 =C2=A0 q =3D &quot;insert into parent_table values&quot; + &quot;,&q=
uot;.join([&quot;(%s, %s)&quot;] * len(batch))<br>
=C2=A0 =C2=A0 csr.execute(q, list(itertools.chain(batch)))<br>
=C2=A0 =C2=A0 db.commit()<br>
=C2=A0 =C2=A0 batch =3D []<br>
<br>
end_time =3D time.monotonic()<br>
elapsed_time =3D end_time - start_time<br>
print(f&quot;Method 3: Batch Inserts ({batch_size})=C2=A0 with Commit after=
 each batch: {elapsed_time:.3} seconds&quot;)<br>
<br>
# vim: tw=3D99<br>
---------------------------------------------------------------------------=
------------------------<br>
<br>
On my laptop, method2 is about twice as fast as method3. But if I<br>
connect to a database on the other side of the city, method2 is now more<br=
>
than 16 times faster than method3 . Simply because the delay in<br>
communication is now large compared to the time it takes to insert those<br=
>
rows.<br><br></blockquote><div><br></div><div>Thank you so much.</div><div>=
I was expecting method-3(batch insert) to be the fastest or atleast as you =
said perform with similar speed=C2=A0as method-2 (row by row insert with ba=
tch commit) if we do it within the procedure inside the database. But becau=
se the context switching will be minimal in method-3 as it will prepare the=
 insert and submit to the database in one shot in one DB call, so it should=
 be a bit fast. But from your figures , it appears to be the opposite , i.e=
.method-2 is faster than method-3. Not able to understand the reason though=
. So in this case then ,it appears we can follow method-2 as that is cheape=
r in regards to less code change , i.e. just shifting the commit points wit=
hout any changes for doing the batch insert.=C2=A0</div><div><br></div><div=
>Btw,Do you have any thoughts,=C2=A0 why method-2 is faster as compared to =
method-3 in your test?</div></div></div>

--000000000000d36f850622a2c30f--