Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ssF2I-0075qv-LA for pgsql-general@arkaria.postgresql.org; Sun, 22 Sep 2024 05:21:31 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ssF2G-00GtqI-A8 for pgsql-general@arkaria.postgresql.org; Sun, 22 Sep 2024 05:21:29 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ssF2F-00GtqA-OL for pgsql-general@lists.postgresql.org; Sun, 22 Sep 2024 05:21:29 +0000 Received: from mail-yw1-x1135.google.com ([2607:f8b0:4864:20::1135]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1ssF29-000R9Y-UR for pgsql-general@lists.postgresql.org; Sun, 22 Sep 2024 05:21:27 +0000 Received: by mail-yw1-x1135.google.com with SMTP id 00721157ae682-6d9f65f9e3eso28034887b3.3 for ; Sat, 21 Sep 2024 22:21:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726982480; x=1727587280; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=dQs/vRB6Y0GRDyRsqfPaCcDOLj89vXlyax9vm7iYnJ0=; b=GCTJJw80cahzprKSXSvRHKUH0Xd0PUIdiutSmhT9HFwwqr/Jp8RrWpyCj3z0SIx43c SW522rV6qS8R0cEaDEfKCDw1+b92Nxvflon1xUHtyc1Y4vjrL6ybrcvRBg2eURgjoXp0 E+bRk0ivIltLL8c1j567DGd+UObAmNVW3sxokuOo1ZP0PGTTLpnsInS2sUOmyuVIX2lP DIgZRs3lHvutO80MUt5Xu955pU17kd0/thb9p90+RUU9zNh2ta5yL28nzem1jacVOYLs t2WzA6qXMxSIosNRb8A9jCzJ8SBTpbEIwObx1pGhoqlb3xaLZNyOEjiEjupA3CzhFO4v ZoTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726982480; x=1727587280; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dQs/vRB6Y0GRDyRsqfPaCcDOLj89vXlyax9vm7iYnJ0=; b=ZXHPKsDjSWpIElrEHMqGaPRcePTKYxJbFod24+PPzF7izfTsIyLJIooiweZ/W2dgTG wn8uhz5sl0Hb/juPDlW/j+GYWxVQP+OGmIc4tWU0mGA06axI2BlAE0ouuarLlmz6aQQT CNeaLqqUaOghJt6Flpm/NutYR04SmhDcykfJh8oU80N+krYVe+QMgzGG6lSsdgglYw6E QnAdgRlVN8RLxfo43cn6FPIsUzQTRVnxSZtb2u2ZeWcNrzVHKCcvqGW1BKQz/7pqcQVA rbopxTkXDALPxIjnatr89a93c+c9FJml+TvpDaunRnueTtqm2AMjlE05AbkZa+hPrLDq 8+GQ== X-Gm-Message-State: AOJu0YyvjNe/0M/Tbl3plijp7T65wGJpM1aFcaarIQWYr3f5nVrYqvXI d8wr7SOmgqtV9B7D3jdf7OC7BgshQNHjJEgZEBbkTu8zHAMTaANYyzEk7EPAWsGjmEIMFOSAZ0m MYL+/y3uNgMfChOxB4AY371/Gdec= X-Google-Smtp-Source: AGHT+IGMlkScTk6ffRusW3O0kdG8rK4OsJKLgc6Umturt30orb/AcNrhcI8tir5VTPOaDeTMC2LfnIs/h7q4LaqxF78= X-Received: by 2002:a05:690c:4b88:b0:6dd:b920:1cb5 with SMTP id 00721157ae682-6dfeed5a4ccmr67661777b3.25.1726982480328; Sat, 21 Sep 2024 22:21:20 -0700 (PDT) MIME-Version: 1.0 References: <4178E73A-24F5-4E3C-92F6-1532D8102C3E@kleczek.org> <20240921143629.t2x37xfczeeunpnf@hjp.at> <2fe0be15-f64e-465c-9dd2-b55c559ac7e2@aklaver.com> In-Reply-To: <2fe0be15-f64e-465c-9dd2-b55c559ac7e2@aklaver.com> From: Lok P Date: Sun, 22 Sep 2024 10:51:07 +0530 Message-ID: Subject: Re: How batch processing works To: Adrian Klaver Cc: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000004603790622ae711b" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000004603790622ae711b Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Sep 22, 2024 at 12:46=E2=80=AFAM Adrian Klaver wrote: > On 9/21/24 07:36, Peter J. Holzer wrote: > > On 2024-09-21 16:44:08 +0530, Lok P wrote: > > > > -------------------------------------------------------------------------= -------------------------- > > #!/usr/bin/python3 > > > > import time > > import psycopg2 > > > > num_inserts =3D 10_000 > > batch_size =3D 50 > > > > db =3D psycopg2.connect() > > csr =3D db.cursor() > > > > csr.execute("drop table if exists parent_table") > > csr.execute("create table parent_table (id int primary key, t text)") > > db.commit() > > > > start_time =3D time.monotonic() > > for i in range(1, num_inserts+1): > > csr.execute("insert into parent_table values(%s, %s)", (i, 'a')) > > if i % batch_size =3D=3D 0: > > db.commit() > > db.commit() > > end_time =3D time.monotonic() > > elapsed_time =3D end_time - start_time > > print(f"Method 2: Individual Inserts with Commit after {batch_size} > Rows: {elapsed_time:.3} seconds") > > > > # vim: tw=3D99 > > > -------------------------------------------------------------------------= -------------------------- > > FYI, this is less of problem with psycopg(3) and pipeline mode: > > import time > import psycopg > > num_inserts =3D 10_000 > batch_size =3D 50 > > db =3D psycopg.connect("dbname=3Dtest user=3Dpostgres host=3D104.237.158.= 68") > csr =3D db.cursor() > > csr.execute("drop table if exists parent_table") > csr.execute("create table parent_table (id int primary key, t text)") > db.commit() > > start_time =3D time.monotonic() > with db.pipeline(): > for i in range(1, num_inserts+1): > csr.execute("insert into parent_table values(%s, %s)", (i, 'a')) > if i % batch_size =3D=3D 0: > db.commit() > db.commit() > end_time =3D time.monotonic() > elapsed_time =3D end_time - start_time > print(f"Method 2: Individual Inserts(psycopg3 pipeline mode) with Commit > after {batch_size} Rows: {elapsed_time:.3} seconds") > > > For remote to a database in another state that took the time from: > > Method 2: Individual Inserts with Commit after 50 Rows: 2.42e+02 seconds > > to: > > Method 2: Individual Inserts(psycopg3 pipeline mode) with Commit after > 50 Rows: 9.83 seconds > > > #!/usr/bin/python3 > > > > import itertools > > import time > > import psycopg2 > > > > num_inserts =3D 10_000 > > batch_size =3D 50 > > > > db =3D psycopg2.connect() > > csr =3D db.cursor() > > > > csr.execute("drop table if exists parent_table") > > csr.execute("create table parent_table (id int primary key, t text)") > > db.commit() > > > > start_time =3D time.monotonic() > > batch =3D [] > > for i in range(1, num_inserts+1): > > batch.append((i, 'a')) > > if i % batch_size =3D=3D 0: > > q =3D "insert into parent_table values" + ",".join(["(%s, %s)"= ] * > len(batch)) > > params =3D list(itertools.chain.from_iterable(batch)) > > csr.execute(q, params) > > db.commit() > > batch =3D [] > > if batch: > > q =3D "insert into parent_table values" + ",".join(["(%s, %s)"] * > len(batch)) > > csr.execute(q, list(itertools.chain(batch))) > > db.commit() > > batch =3D [] > > > > end_time =3D time.monotonic() > > elapsed_time =3D end_time - start_time > > print(f"Method 3: Batch Inserts ({batch_size}) with Commit after each > batch: {elapsed_time:.3} seconds") > > > > # vim: tw=3D99 > > > -------------------------------------------------------------------------= -------------------------- > > The above can also be handled with execute_batch() and execute_values() > from: > > https://www.psycopg.org/docs/extras.html#fast-execution-helpers > > > > > On my laptop, method2 is about twice as fast as method3. But if I > > connect to a database on the other side of the city, method2 is now mor= e > > than 16 times faster than method3 . Simply because the delay in > > communication is now large compared to the time it takes to insert thos= e > > rows. > > > > > > Thank you. So if I get it correct, if the client app(from which the data is getting streamed/inserted) is in the same data center/zone as the database (which is most of the time the case) then the batch insert does not appear to be much beneficial. Which also means , people here were afraid of having triggers in such a high dml table as because this will make the "batch insert" automatically converted into "row by row" behind the scene, but considering the above results, it looks fine to go with a row by row approach (but just having batch commit in place in place of row by row commit). And not to worry about implementing the true batch insert approach as that is not making a big difference here in data load performance. --0000000000004603790622ae711b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Sun, Sep 22, 2024 at 12:46=E2=80= =AFAM Adrian Klaver <adrian= .klaver@aklaver.com> wrote:
On 9/21/24 07:36, Peter J. Holzer wrote:
> On 2024-09-21 16:44:08 +0530, Lok P wrote:

> ----------------------------------------------------------------------= -----------------------------
> #!/usr/bin/python3
>
> import time
> import psycopg2
>
> num_inserts =3D 10_000
> batch_size =3D 50
>
> db =3D psycopg2.connect()
> csr =3D db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t tex= t)")
> db.commit()
>
> start_time =3D time.monotonic()
> for i in range(1, num_inserts+1):
>=C2=A0 =C2=A0 =C2=A0 csr.execute("insert into parent_table values(= %s, %s)", (i, 'a'))
>=C2=A0 =C2=A0 =C2=A0 if i % batch_size =3D=3D 0:
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 db.commit()
> db.commit()
> end_time =3D time.monotonic()
> elapsed_time =3D end_time - start_time
> print(f"Method 2: Individual Inserts with Commit after {batch_siz= e}=C2=A0 Rows: {elapsed_time:.3} seconds")
>
> # vim: tw=3D99
> ----------------------------------------------------------------------= -----------------------------

FYI, this is less of problem with psycopg(3) and pipeline mode:

import time
import psycopg

num_inserts =3D 10_000
batch_size =3D 50

db =3D psycopg.connect("dbname=3Dtest user=3Dpostgres host=3D104.237.1= 58.68")
csr =3D db.cursor()

csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)&qu= ot;)
db.commit()

start_time =3D time.monotonic()
with db.pipeline():
=C2=A0 =C2=A0 =C2=A0for i in range(1, num_inserts+1):
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0csr.execute("insert into parent_tabl= e values(%s, %s)", (i, 'a'))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if i % batch_size =3D=3D 0:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0db.commit()
db.commit()
end_time =3D time.monotonic()
elapsed_time =3D end_time - start_time
print(f"Method 2: Individual Inserts(psycopg3 pipeline mode) with Comm= it
after {batch_size}=C2=A0 Rows: {elapsed_time:.3} seconds")


For remote to a database in another state that took the=C2=A0 time from:
Method 2: Individual Inserts with Commit after 50=C2=A0 Rows: 2.42e+02 seco= nds

to:

Method 2: Individual Inserts(psycopg3 pipeline mode) with Commit after
50=C2=A0 Rows: 9.83 seconds

> #!/usr/bin/python3
>
> import itertools
> import time
> import psycopg2
>
> num_inserts =3D 10_000
> batch_size =3D 50
>
> db =3D psycopg2.connect()
> csr =3D db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t tex= t)")
> db.commit()
>
> start_time =3D time.monotonic()
> batch =3D []
> for i in range(1, num_inserts+1):
>=C2=A0 =C2=A0 =C2=A0 batch.append((i, 'a'))
>=C2=A0 =C2=A0 =C2=A0 if i % batch_size =3D=3D 0:
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 q =3D "insert into parent_table= values" + ",".join(["(%s, %s)"] * len(batch))
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 params =3D list(itertools.chain.from= _iterable(batch))
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 csr.execute(q, params)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 db.commit()
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 batch =3D []
> if batch:
>=C2=A0 =C2=A0 =C2=A0 q =3D "insert into parent_table values" = + ",".join(["(%s, %s)"] * len(batch))
>=C2=A0 =C2=A0 =C2=A0 csr.execute(q, list(itertools.chain(batch)))
>=C2=A0 =C2=A0 =C2=A0 db.commit()
>=C2=A0 =C2=A0 =C2=A0 batch =3D []
>
> end_time =3D time.monotonic()
> elapsed_time =3D end_time - start_time
> print(f"Method 3: Batch Inserts ({batch_size})=C2=A0 with Commit = after each batch: {elapsed_time:.3} seconds")
>
> # vim: tw=3D99
> ----------------------------------------------------------------------= -----------------------------

The above can also be handled with execute_batch() and execute_values() from:

https://www.psycopg.org/docs/extras.h= tml#fast-execution-helpers

>
> On my laptop, method2 is about twice as fast as method3. But if I
> connect to a database on the other side of the city, method2 is now mo= re
> than 16 times faster than method3 . Simply because the delay in
> communication is now large compared to the time it takes to insert tho= se
> rows.
>




Thank you. So if I get it correct, if = the client app(from which the data is getting streamed/inserted) is in the = same data center/zone as the database (which is most of the time the case) = then the batch insert does not appear to be much beneficial.=C2=A0

Which also means , people here were afraid of having trigg= ers in such a high dml table as because this will make the "batch inse= rt" automatically=C2=A0 converted into "row by row" behind t= he scene, but considering the above=C2=A0results, it looks fine to go with = a row by row approach (but just having batch commit in place in place of ro= w by row commit). And not to worry about implementing the true batch insert= approach as that is not making=C2=A0a big difference here in data load per= formance.
--0000000000004603790622ae711b--