Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ss2z1-005iW7-9y for pgsql-general@arkaria.postgresql.org; Sat, 21 Sep 2024 16:29:20 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ss1zK-0055GI-Nf for pgsql-general@arkaria.postgresql.org; Sat, 21 Sep 2024 15:25:36 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ss1zK-0055FE-9N for pgsql-general@lists.postgresql.org; Sat, 21 Sep 2024 15:25:35 +0000 Received: from mail-yb1-xb2a.google.com ([2607:f8b0:4864:20::b2a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1ss1zD-000MW7-Bq for pgsql-general@lists.postgresql.org; Sat, 21 Sep 2024 15:25:34 +0000 Received: by mail-yb1-xb2a.google.com with SMTP id 3f1490d57ef6-e05f25fb96eso2748802276.1 for ; Sat, 21 Sep 2024 08:25:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726932326; x=1727537126; darn=lists.postgresql.org; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=qRFyJOdtiff2Mj2mrLe0wrTikvu7gRsA5OyN6Bk8OnI=; b=VB4j2Nm4h2ddsk+zppbidja8z1mcOMg1HoggUy45Kv0TNoKUL3YKEn2caR/qKRAUK+ fhxGYfP/+kNFc4XzYDhORN+caWcWHq/Zx/OTCnCy6P3Oaxe865L+7GKZki373w4DnmA0 lepitDiP5icmme15C/tgympf0g1Ec9Sm/C6Vk5GNiymw/HZ54MFpXey6EHatQksCef8G MpEEMYXXw0E6OXYHwIn3iPrU4s+BfdkYimchCP0O8qOfjWUiNBfQWyI2ecqYmwtkkscS OTQmxRieMkJ5qrhoUcYRVJ3S9C3JWDInk2HfCP28qHKHqwmnVLo7FOu6nbLKNVKvM70t T8dA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726932326; x=1727537126; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=qRFyJOdtiff2Mj2mrLe0wrTikvu7gRsA5OyN6Bk8OnI=; b=hhTkVOQwIQzXvTSFvkuFBPUxDOQq7GhTZXH5p2QsS/v6OHGXN/A+jS+trcA+5ILM6u QQGIRCDJBuKI4vwstw+zM8jOu3JPEcW6n6UEEsMYFUsTmXVBo3qRq6zyP65SEtZLgqAe mrq5mz9gozrhR1OwDqLk8lAzxrIe+B7PW88pghaNCnBNhPgxv/MG9KgaIL3B02Ue7XSJ cOyebJnAZ92quF8Fz+2RkjbCviWpu6RPqHXL+REii6lmzuLGe6U2DHPMpNDZjXPBHbtN aLdjMM/ZGFAsk98ktUIwRFHGh/2p+EufoagldkOc4R8eUPrIT42aPnFSBO76B2IX3sMZ KE/Q== X-Gm-Message-State: AOJu0YzYj2AZSPh1vLjOJ43OVPq+T3z9P8QbitYvUYJfWO80A8paD19u IXI9pBY1J0pmsk6qAAp1yqV18PtbY2RtrD09KdDErxk28c0cLauZeqFxZ5PejDJvwvL3chfS4dr q5ZGsmtW37DiF6/89xtAVhmzuTqk7xw== X-Google-Smtp-Source: AGHT+IHPZoBFwwjmURKboLnUQdDhXx47DNGute6XP5+Kvnr+znQgyh0DGM5yftDNRkc+zudGT3W0aX8xAFqNbZVxT3Q= X-Received: by 2002:a05:690c:f8f:b0:6d3:98b1:e3bc with SMTP id 00721157ae682-6dfeeeeae3amr62395987b3.32.1726932325721; Sat, 21 Sep 2024 08:25:25 -0700 (PDT) MIME-Version: 1.0 References: <4178E73A-24F5-4E3C-92F6-1532D8102C3E@kleczek.org> <20240921143629.t2x37xfczeeunpnf@hjp.at> In-Reply-To: <20240921143629.t2x37xfczeeunpnf@hjp.at> From: Lok P Date: Sat, 21 Sep 2024 20:55:13 +0530 Message-ID: Subject: Re: How batch processing works To: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000d36f850622a2c30f" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000d36f850622a2c30f Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sat, Sep 21, 2024 at 8:07=E2=80=AFPM Peter J. Holzer = wrote: > On 2024-09-21 16:44:08 +0530, Lok P wrote: > > But wondering why we don't see any difference in performance between > method-2 > > and method-3 above. > > The code runs completely inside the database. So there isn't much > difference between a single statement which inserts 50 rows and 50 > statements which insert 1 row each. The work to be done is (almost) the > same. > > This changes once you consider an application which runs outside of the > database (maybe even on a different host). Such an application has to > wait for the result of each statement before it can send the next one. > Now it makes a difference whether you are waiting 50 times for a > statement which does very little or just once for a statement which does > more work. > > > So does it mean that,I am testing this in a wrong way or > > That depends on what you want to test. If you are interested in the > behaviour of stored procedures, the test is correct. If you want to know > about the performance of a database client (whether its written in Java, > Python, Go or whatever), this is the wrong test. You have to write the > test in your target language and run it on the client system to get > realistic results (for example, the round-trip times will be a lot > shorter if the client and database are on the same computer than when > one is in Europe and the other in America). > > For example, here are the three methods as Python scripts: > > > -------------------------------------------------------------------------= -------------------------- > #!/usr/bin/python3 > > import time > import psycopg2 > > num_inserts =3D 10_000 > > db =3D psycopg2.connect() > csr =3D db.cursor() > > csr.execute("drop table if exists parent_table") > csr.execute("create table parent_table (id int primary key, t text)") > > start_time =3D time.monotonic() > for i in range(1, num_inserts+1): > csr.execute("insert into parent_table values(%s, %s)", (i, 'a')) > db.commit() > end_time =3D time.monotonic() > elapsed_time =3D end_time - start_time > print(f"Method 1: Individual Inserts with Commit after every Row: > {elapsed_time:.3} seconds") > > # vim: tw=3D99 > > -------------------------------------------------------------------------= -------------------------- > #!/usr/bin/python3 > > import time > import psycopg2 > > num_inserts =3D 10_000 > batch_size =3D 50 > > db =3D psycopg2.connect() > csr =3D db.cursor() > > csr.execute("drop table if exists parent_table") > csr.execute("create table parent_table (id int primary key, t text)") > db.commit() > > start_time =3D time.monotonic() > for i in range(1, num_inserts+1): > csr.execute("insert into parent_table values(%s, %s)", (i, 'a')) > if i % batch_size =3D=3D 0: > db.commit() > db.commit() > end_time =3D time.monotonic() > elapsed_time =3D end_time - start_time > print(f"Method 2: Individual Inserts with Commit after {batch_size} Rows= : > {elapsed_time:.3} seconds") > > # vim: tw=3D99 > > -------------------------------------------------------------------------= -------------------------- > #!/usr/bin/python3 > > import itertools > import time > import psycopg2 > > num_inserts =3D 10_000 > batch_size =3D 50 > > db =3D psycopg2.connect() > csr =3D db.cursor() > > csr.execute("drop table if exists parent_table") > csr.execute("create table parent_table (id int primary key, t text)") > db.commit() > > start_time =3D time.monotonic() > batch =3D [] > for i in range(1, num_inserts+1): > batch.append((i, 'a')) > if i % batch_size =3D=3D 0: > q =3D "insert into parent_table values" + ",".join(["(%s, %s)"] * > len(batch)) > params =3D list(itertools.chain.from_iterable(batch)) > csr.execute(q, params) > db.commit() > batch =3D [] > if batch: > q =3D "insert into parent_table values" + ",".join(["(%s, %s)"] * > len(batch)) > csr.execute(q, list(itertools.chain(batch))) > db.commit() > batch =3D [] > > end_time =3D time.monotonic() > elapsed_time =3D end_time - start_time > print(f"Method 3: Batch Inserts ({batch_size}) with Commit after each > batch: {elapsed_time:.3} seconds") > > # vim: tw=3D99 > > -------------------------------------------------------------------------= -------------------------- > > On my laptop, method2 is about twice as fast as method3. But if I > connect to a database on the other side of the city, method2 is now more > than 16 times faster than method3 . Simply because the delay in > communication is now large compared to the time it takes to insert those > rows. > > Thank you so much. I was expecting method-3(batch insert) to be the fastest or atleast as you said perform with similar speed as method-2 (row by row insert with batch commit) if we do it within the procedure inside the database. But because the context switching will be minimal in method-3 as it will prepare the insert and submit to the database in one shot in one DB call, so it should be a bit fast. But from your figures , it appears to be the opposite , i.e.method-2 is faster than method-3. Not able to understand the reason though. So in this case then ,it appears we can follow method-2 as that is cheaper in regards to less code change , i.e. just shifting the commit points without any changes for doing the batch insert. Btw,Do you have any thoughts, why method-2 is faster as compared to method-3 in your test? --000000000000d36f850622a2c30f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Sat, Sep 21, 2024 at 8:07=E2=80=AF= PM Peter J. Holzer <hjp-pgsql@hjp.at= > wrote:
= On 2024-09-21 16:44:08 +0530, Lok P wrote:
> But wondering why we don't see any difference in performance=C2=A0= between method-2
> and method-3 above.

The code runs completely inside the database. So there isn't much
difference between a single statement which inserts 50 rows and 50
statements which insert 1 row each. The work to be done is (almost) the
same.

This changes once you consider an application which runs outside of the
database (maybe even on a different host). Such an application has to
wait for the result of each statement before it can send the next one.
Now it makes a difference whether you are waiting 50 times for a
statement which does very little or just once for a statement which does more work.

> So does it mean that,I am testing this in a wrong way or

That depends on what you want to test. If you are interested in the
behaviour of stored procedures, the test is correct. If you want to know about the performance of a database client (whether its written in Java, Python, Go or whatever), this is the wrong test. You have to write the
test in your target language and run it on the client system to get
realistic results (for example, the round-trip times will be a lot
shorter if the client and database are on the same computer than when
one is in Europe and the other in America).

For example, here are the three methods as Python scripts:

---------------------------------------------------------------------------= ------------------------
#!/usr/bin/python3

import time
import psycopg2

num_inserts =3D 10_000

db =3D psycopg2.connect()
csr =3D db.cursor()

csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)&qu= ot;)

start_time =3D time.monotonic()
for i in range(1, num_inserts+1):
=C2=A0 =C2=A0 csr.execute("insert into parent_table values(%s, %s)&quo= t;, (i, 'a'))
=C2=A0 =C2=A0 db.commit()
end_time =3D time.monotonic()
elapsed_time =3D end_time - start_time
print(f"Method 1: Individual Inserts with Commit after every Row: {ela= psed_time:.3} seconds")

# vim: tw=3D99
---------------------------------------------------------------------------= ------------------------
#!/usr/bin/python3

import time
import psycopg2

num_inserts =3D 10_000
batch_size =3D 50

db =3D psycopg2.connect()
csr =3D db.cursor()

csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)&qu= ot;)
db.commit()

start_time =3D time.monotonic()
for i in range(1, num_inserts+1):
=C2=A0 =C2=A0 csr.execute("insert into parent_table values(%s, %s)&quo= t;, (i, 'a'))
=C2=A0 =C2=A0 if i % batch_size =3D=3D 0:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 db.commit()
db.commit()
end_time =3D time.monotonic()
elapsed_time =3D end_time - start_time
print(f"Method 2: Individual Inserts with Commit after {batch_size}=C2= =A0 Rows: {elapsed_time:.3} seconds")

# vim: tw=3D99
---------------------------------------------------------------------------= ------------------------
#!/usr/bin/python3

import itertools
import time
import psycopg2

num_inserts =3D 10_000
batch_size =3D 50

db =3D psycopg2.connect()
csr =3D db.cursor()

csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)&qu= ot;)
db.commit()

start_time =3D time.monotonic()
batch =3D []
for i in range(1, num_inserts+1):
=C2=A0 =C2=A0 batch.append((i, 'a'))
=C2=A0 =C2=A0 if i % batch_size =3D=3D 0:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 q =3D "insert into parent_table values&quo= t; + ",".join(["(%s, %s)"] * len(batch))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 params =3D list(itertools.chain.from_iterable(b= atch))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 csr.execute(q, params)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 db.commit()
=C2=A0 =C2=A0 =C2=A0 =C2=A0 batch =3D []
if batch:
=C2=A0 =C2=A0 q =3D "insert into parent_table values" + ",&q= uot;.join(["(%s, %s)"] * len(batch))
=C2=A0 =C2=A0 csr.execute(q, list(itertools.chain(batch)))
=C2=A0 =C2=A0 db.commit()
=C2=A0 =C2=A0 batch =3D []

end_time =3D time.monotonic()
elapsed_time =3D end_time - start_time
print(f"Method 3: Batch Inserts ({batch_size})=C2=A0 with Commit after= each batch: {elapsed_time:.3} seconds")

# vim: tw=3D99
---------------------------------------------------------------------------= ------------------------

On my laptop, method2 is about twice as fast as method3. But if I
connect to a database on the other side of the city, method2 is now more than 16 times faster than method3 . Simply because the delay in
communication is now large compared to the time it takes to insert those rows.


Thank you so much.
= I was expecting method-3(batch insert) to be the fastest or atleast as you = said perform with similar speed=C2=A0as method-2 (row by row insert with ba= tch commit) if we do it within the procedure inside the database. But becau= se the context switching will be minimal in method-3 as it will prepare the= insert and submit to the database in one shot in one DB call, so it should= be a bit fast. But from your figures , it appears to be the opposite , i.e= .method-2 is faster than method-3. Not able to understand the reason though= . So in this case then ,it appears we can follow method-2 as that is cheape= r in regards to less code change , i.e. just shifting the commit points wit= hout any changes for doing the batch insert.=C2=A0

Btw,Do you have any thoughts,=C2=A0 why method-2 is faster as compared to = method-3 in your test?
--000000000000d36f850622a2c30f--