MIME-Version: 1.0
References: <cb014f00-66b2-4328-a65e-d11c681c9f45@gmail.com>
 <CAEvyyTj0rEsgcQOQgkARbRPbupHR_mc=TUzHBBLNzd8JByUUTw@mail.gmail.com>
 <4c1d0b97-a5f8-472c-afdd-bdeb09b93f33@gmail.com>
 <OS9PR01MB12149A6CEE200E1A5A88D9F14F563A@OS9PR01MB12149.jpnprd01.prod.outlook.com>
 <CAEvyyTiQqd=rv3XUxc0YEaW-feopksBveZKKjVZNeSVG=GrY+A@mail.gmail.com>
 <TYRPR01MB121560B291DA3CD262CC7A09AF568A@TYRPR01MB12156.jpnprd01.prod.outlook.com>
 <CAEvyyTjPWfvJLn3c_G_zLRffZ3=YqzMYj6c5znaNxpHyZAg3XQ@mail.gmail.com>
 <10868918-cdf9-49dc-99af-8e8ccd6e368c@gmail.com>
In-Reply-To: <10868918-cdf9-49dc-99af-8e8ccd6e368c@gmail.com>
From: lakshmi <lakshmigcdac@gmail.com>
Date: Wed, 18 Mar 2026 16:07:01 +0530
Message-ID: 
 <CAEvyyTircZ-tHgap=J6Aog0CBgXp4Dqx6dHYyK1iqgfoT+8D_A@mail.gmail.com>
Subject: Re: parallel data loading for pgbench -i
To: Mircea Cadariu <cadariu.mircea@gmail.com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 "tomas@vondra.me" <tomas@vondra.me>
Content-Type: multipart/alternative; boundary="0000000000001d5dd2064d49fc32"
Archived-At: 
 <https://www.postgresql.org/message-id/CAEvyyTircZ-tHgap%3DJ6Aog0CBgXp4Dqx6dHYyK1iqgfoT%2B8D_A%40mail.gmail.com>
Precedence: bulk

--0000000000001d5dd2064d49fc32
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Mircea, Hayato,

Thanks for the updated v2 patches.

I applied 0001 and 0002 on 19devel and ran some tests. The results look
consistent.

For scale 100, parallel loading speeds up data generation, but in the
non-partitioned case, the VACUUM phase becomes noticeably slower. In
contrast, the partitioned + parallel case performs best overall with much
lower vacuum cost.

For scale 500, I see the same pattern: non-partitioned parallel runs are
dominated by VACUUM time, while the partitioned setup shows a clear overall
speedup.

I also verified correctness, and row counts match expected values.

So overall, the benefit of parallel loading is much clearer in the
partitioned case.

I=E2=80=99ll try to look further into the VACUUM behavior.

Thanks again for the work on this.

Best regards,
Lakshmi

On Fri, Mar 13, 2026 at 11:59=E2=80=AFPM Mircea Cadariu <cadariu.mircea@gma=
il.com>
wrote:

> Hi Lakshmi, Hayato,
>
>
> Thanks a lot for your input!
>
> I'm not sure why the VACUUM phase takes longer compared to the serial
> run. We can potentially get a clue with a profiler. I know there is an
> ongoing effort to introduce parallel heap vacuum [1] which I expect will
> help with this.
>
> The code comments you have provided me have been applied to the v2 patch
> attached. Below I provide answers to the questions.
>
> > Also, why is -j accepted in case of non-partitions?
> For non-partitioned tables, each worker loads a separate range of rows
> via its own connection in parallel.
>
> > Copying seems to be divided into chunks per COPY_BATCH_SIZE. Is it real=
ly
> > essential to parallelize the initialization? I feel it may optimize eve=
n
> > serialized case thus can be discussed independently.
> You're right that the COPY batching is an optimization that's
> independent. I wanted to see how fast I can get this patch, so I looked
> for bottlenecks in the new code with a profiler and this was one of
> them. I agree it makes sense to apply this for the serialised case
> separately.
>
> > Per my understanding, each thread creates its tables, and all of them a=
re
> > attached to the parent table. Is it right? I think it needs more code
> > changes, and I am not sure it is critical to make initialization faster=
.
> Yes, that's correct. Each worker creates its assigned partitions as
> standalone tables, loads data into them, and then the main thread
> attaches them all to the parent after loading completes. It's to avoid
> AccessExclusiveLock contention on the parent table during parallel
> loading and allow each worker to use COPY FREEZE on its standalone table.
>
> > So I suggest using the incremental approach. The first patch only
> > parallelizes
> > the data load, and the second patch implements the CREATE TABLE and
> > ALTER TABLE
> > ATTACH PARTITION. You can benchmark three patterns, master, 0001, and
> > 0001 + 0002, then compare the results. IIUC, this is the common
> > approach to
> > reduce the patch size and make them more reviewable.
>
> Thanks for the recommendation, I extracted 0001 and 0002 as per your
> suggestion. I will see if I can split it more, as indeed it helps with
> the review.
>
> Results are similar with the previous runs.
>
> master
>
> pgbench -i -s 100 -j 10
> done in 20.95 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 14.51 s, vacuum 0.27 s, primary keys 6.16 s).
>
> pgbench -i -s 100 -j 10 --partitions=3D10
> done in 29.73 s (drop tables 0.00 s, create tables 0.02 s, client-side
> generate 16.33 s, vacuum 8.72 s, primary keys 4.67 s).
>
>
> 0001
> pgbench -i -s 100 -j 10
> done in 18.75 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 6.51 s, vacuum 5.73 s, primary keys 6.50 s).
>
> pgbench -i -s 100 -j 10 --partitions=3D10
> done in 29.33 s (drop tables 0.00 s, create tables 0.02 s, client-side
> generate 16.48 s, vacuum 7.59 s, primary keys 5.24 s).
>
> 0002
> pgbench -i -s 100 -j 10
> done in 18.12 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 6.64 s, vacuum 5.81 s, primary keys 5.65 s).
>
> pgbench -i -s 100 -j 10 --partitions=3D10
> done in 14.38 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 7.97 s, vacuum 1.55 s, primary keys 4.85 s).
>
>
> Looking forward to your feedback.
>
> [1]:
>
> https://www.postgresql.org/message-id/CAD21AoAEfCNv-GgaDheDJ%2Bs-p_Lv1H24=
AiJeNoPGCmZNSwL1YA%40mail.gmail.com
>
> --
> Thanks,
> Mircea Cadariu
>

--0000000000001d5dd2064d49fc32
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><p class=3D"gmail-isSelectedEnd">Hi Mircea, Hayato,</p><p =
class=3D"gmail-isSelectedEnd">Thanks for the updated v2 patches.</p><p clas=
s=3D"gmail-isSelectedEnd">I applied 0001 and 0002 on 19devel and ran some t=
ests. The results look consistent.</p><p class=3D"gmail-isSelectedEnd">For =
scale 100, parallel loading speeds up data generation, but in the non-parti=
tioned case, the VACUUM phase becomes noticeably slower. In contrast, the p=
artitioned + parallel case performs best overall with much lower vacuum cos=
t.</p><p class=3D"gmail-isSelectedEnd">For scale 500, I see the same patter=
n: non-partitioned parallel runs are dominated by VACUUM time, while the pa=
rtitioned setup shows a clear overall speedup.</p><p class=3D"gmail-isSelec=
tedEnd">I also verified correctness, and row counts match expected values.<=
/p><p class=3D"gmail-isSelectedEnd">So overall, the benefit of parallel loa=
ding is much clearer in the partitioned case.</p><p class=3D"gmail-isSelect=
edEnd">I=E2=80=99ll try to look further into the VACUUM behavior.</p><p cla=
ss=3D"gmail-isSelectedEnd">Thanks again for the work on this.</p><p>Best re=
gards,<br>Lakshmi</p></div><br><div class=3D"gmail_quote gmail_quote_contai=
ner"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, Mar 13, 2026 at 11:59=E2=
=80=AFPM Mircea Cadariu &lt;<a href=3D"mailto:cadariu.mircea@gmail.com">cad=
ariu.mircea@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,20=
4);padding-left:1ex">Hi Lakshmi, Hayato,<br>
<br>
<br>
Thanks a lot for your input!<br>
<br>
I&#39;m not sure why the VACUUM phase takes longer compared to the serial <=
br>
run. We can potentially get a clue with a profiler. I know there is an <br>
ongoing effort to introduce parallel heap vacuum [1] which I expect will <b=
r>
help with this.<br>
<br>
The code comments you have provided me have been applied to the v2 patch <b=
r>
attached. Below I provide answers to the questions.<br>
<br>
&gt; Also, why is -j accepted in case of non-partitions?<br>
For non-partitioned tables, each worker loads a separate range of rows <br>
via its own connection in parallel.<br>
<br>
&gt; Copying seems to be divided into chunks per COPY_BATCH_SIZE. Is it rea=
lly<br>
&gt; essential to parallelize the initialization? I feel it may optimize ev=
en<br>
&gt; serialized case thus can be discussed independently.<br>
You&#39;re right that the COPY batching is an optimization that&#39;s <br>
independent. I wanted to see how fast I can get this patch, so I looked <br=
>
for bottlenecks in the new code with a profiler and this was one of <br>
them. I agree it makes sense to apply this for the serialised case <br>
separately.<br>
<br>
&gt; Per my understanding, each thread creates its tables, and all of them =
are<br>
&gt; attached to the parent table. Is it right? I think it needs more code<=
br>
&gt; changes, and I am not sure it is critical to make initialization faste=
r.<br>
Yes, that&#39;s correct. Each worker creates its assigned partitions as <br=
>
standalone tables, loads data into them, and then the main thread <br>
attaches them all to the parent after loading completes. It&#39;s to avoid =
<br>
AccessExclusiveLock contention on the parent table during parallel <br>
loading and allow each worker to use COPY FREEZE on its standalone table.<b=
r>
<br>
&gt; So I suggest using the incremental approach. The first patch only <br>
&gt; parallelizes<br>
&gt; the data load, and the second patch implements the CREATE TABLE and <b=
r>
&gt; ALTER TABLE<br>
&gt; ATTACH PARTITION. You can benchmark three patterns, master, 0001, and<=
br>
&gt; 0001 + 0002, then compare the results. IIUC, this is the common <br>
&gt; approach to<br>
&gt; reduce the patch size and make them more reviewable.<br>
<br>
Thanks for the recommendation, I extracted 0001 and 0002 as per your <br>
suggestion. I will see if I can split it more, as indeed it helps with <br>
the review.<br>
<br>
Results are similar with the previous runs.<br>
<br>
master<br>
<br>
pgbench -i -s 100 -j 10<br>
done in 20.95 s (drop tables 0.00 s, create tables 0.01 s, client-side <br>
generate 14.51 s, vacuum 0.27 s, primary keys 6.16 s).<br>
<br>
pgbench -i -s 100 -j 10 --partitions=3D10<br>
done in 29.73 s (drop tables 0.00 s, create tables 0.02 s, client-side <br>
generate 16.33 s, vacuum 8.72 s, primary keys 4.67 s).<br>
<br>
<br>
0001<br>
pgbench -i -s 100 -j 10<br>
done in 18.75 s (drop tables 0.00 s, create tables 0.01 s, client-side <br>
generate 6.51 s, vacuum 5.73 s, primary keys 6.50 s).<br>
<br>
pgbench -i -s 100 -j 10 --partitions=3D10<br>
done in 29.33 s (drop tables 0.00 s, create tables 0.02 s, client-side <br>
generate 16.48 s, vacuum 7.59 s, primary keys 5.24 s).<br>
<br>
0002<br>
pgbench -i -s 100 -j 10<br>
done in 18.12 s (drop tables 0.00 s, create tables 0.01 s, client-side <br>
generate 6.64 s, vacuum 5.81 s, primary keys 5.65 s).<br>
<br>
pgbench -i -s 100 -j 10 --partitions=3D10<br>
done in 14.38 s (drop tables 0.00 s, create tables 0.01 s, client-side <br>
generate 7.97 s, vacuum 1.55 s, primary keys 4.85 s).<br>
<br>
<br>
Looking forward to your feedback.<br>
<br>
[1]: <br>
<a href=3D"https://www.postgresql.org/message-id/CAD21AoAEfCNv-GgaDheDJ%2Bs=
-p_Lv1H24AiJeNoPGCmZNSwL1YA%40mail.gmail.com" rel=3D"noreferrer" target=3D"=
_blank">https://www.postgresql.org/message-id/CAD21AoAEfCNv-GgaDheDJ%2Bs-p_=
Lv1H24AiJeNoPGCmZNSwL1YA%40mail.gmail.com</a><br>
<br>
-- <br>
Thanks,<br>
Mircea Cadariu<br>
</blockquote></div>

--0000000000001d5dd2064d49fc32--