Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1p4nOr-0007pw-BL for pgsql-sql@arkaria.postgresql.org; Mon, 12 Dec 2022 18:19:37 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1p4nOo-0000M7-Py for pgsql-sql@arkaria.postgresql.org; Mon, 12 Dec 2022 18:19:34 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1p4nOo-0000KH-BW for pgsql-sql@lists.postgresql.org; Mon, 12 Dec 2022 18:19:34 +0000 Received: from premium22-2.web-hosting.com ([68.65.122.104]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1p4nOe-0007kt-S4 for pgsql-sql@lists.postgresql.org; Mon, 12 Dec 2022 18:19:32 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=misuse.org; s=default; h=Content-Type:Cc:To:Subject:Message-ID:Date:From:In-Reply-To: References:MIME-Version:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=r7GHxUFlR7akjqDpHcTa/ZjqDAUkQeor14J6uFvlUw0=; b=bI4aZ5GIDb+ITv4RufU/iih25S CzeQBukP0YbLXUKdqPulgwx5WFX9DVSPK8owUhYTDwWKrJ+yMJOvTy1Ysk4a8DtBS82frWn9/t0mf CayWJpJkBmAANld57bJL626/mOSxAPVurxtCQjz/WEopUCO98Iat/EFXTr6s5AiZ5iGQFY6OeYz0F S0mKMajk3vNLXtzKD+k4SUZlxP3Px8wILvShBo7jLc6KIJxWEOqyKqzjvDF5h/bhAXWACYxB4v/St twBDvWsp74jNwN0mPImlOOg4GBK6p9a7ZjdTRFjCIU/SAKDCoMZY8b5BWLSoHPgs8VplLZwv3PHUh TRXX2JAw==; Received: from mail-yb1-f179.google.com ([209.85.219.179]:43867) by premium22.web-hosting.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.95) (envelope-from ) id 1p4nOT-008eZf-7i for pgsql-sql@lists.postgresql.org; Mon, 12 Dec 2022 13:19:23 -0500 Received: by mail-yb1-f179.google.com with SMTP id d128so14714309ybf.10 for ; Mon, 12 Dec 2022 10:19:12 -0800 (PST) X-Gm-Message-State: ANoB5pkJ6WUOqi+wUs2I/jdN71ITVq0Wx2JiNMbDsanaxB/+3SkkJZLB TgFaAtg2kPUOlYEIPNy0gWV1ezaIT8i5H9pAkps= X-Google-Smtp-Source: AA0mqf5cAhH6icsbDkikcr3Cvq4yk3Ex95nNy1SlRAmf9OQvteRnVBZHy+/ih8sqMHQ1RawiapbY3oxBTHUL5OzhQLg= X-Received: by 2002:a25:9f85:0:b0:700:f93d:c7cd with SMTP id u5-20020a259f85000000b00700f93dc7cdmr18415271ybq.166.1670869152045; Mon, 12 Dec 2022 10:19:12 -0800 (PST) MIME-Version: 1.0 References: <1e0484b6-40a6-15a9-9890-0991c8b8c1da@gmail.com> <7BD04D8E-9307-4107-B4D0-4C281390F996@gmail.com> In-Reply-To: <7BD04D8E-9307-4107-B4D0-4C281390F996@gmail.com> From: Steve Midgley Date: Mon, 12 Dec 2022 10:19:00 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: PARALLEL CTAS To: Shane Borden Cc: Rob Sargent , pgsql-sql@lists.postgresql.org Content-Type: multipart/alternative; boundary="00000000000046230a05efa58afe" X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - premium22.web-hosting.com X-AntiAbuse: Original Domain - lists.postgresql.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - misuse.org X-Get-Message-Sender-Via: premium22.web-hosting.com: authenticated_id: science@misuse.org X-Authenticated-Sender: premium22.web-hosting.com: science@misuse.org X-Source: X-Source-Args: X-Source-Dir: X-From-Rewrite: unmodified, already matched List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000046230a05efa58afe Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Dec 12, 2022 at 9:36 AM Shane Borden wrote: > I saw this as an option. Let=E2=80=99s say you have a million row table.= If you > copy from STDIN, do you have to spool the source table to a local file > first? Looking to replace a CTAS thats doing this all within SQL > statements now. > --- > > Thanks, > > > Shane Borden > sborden76@gmail.com > > On Dec 12, 2022, at 12:25 PM, Rob Sargent wrote: > > On 12/12/22 10:13, Shane Borden wrote: > > The issue is there are certain performance benefits to be had by doing > parallel CTAS operations and when converting from Oracle to PostgreSQL > switching to a =E2=80=9CCOPY=E2=80=9D operation isn=E2=80=99t feasible. > --- > > Thanks, > > Today I suspect you're left with something like the following: > - CTAS from source where 1=3D2 (i.e. table definition via select semantic= s) > - copy from stdin (filled with intended CTAS select) > > I'm not at all familiar with Oracle / CTAS (fair warning). But I did spen= d some time, years ago, building a fast import using COPY/STDIN in Postgres awhile back. I copied that code into a Ruby language gist demonstrating the core concept recently. The key idea, IMO, is to use an intermediate language like Ruby to manage the "spooling" from the source file to STDIN. That system could presumably fork processes to have more than one COPY from the file going at a time (and coordinate each thread to access different rows of the source file). However, my experience was that if you want the absolutely fastest import, you should copy the source file to a fast disk/array on the Postgres server itself, and then run the COPY command pointing to that file locally, not piping from STDIN over a network connection. Postgres, at least in my testing, maxes I/O on a single COPY command pretty heartily - so much so that you might find performance otherwise becomes a problem, esp if your DBs are on that disk/array. IIRC, I was seeing millions of rows loaded per second on a high performing Postgres server, using this method. Also, IIRC, I removed the indexes from the table, again to max throughput, and then added them back after the COPY completed (not sure if modern Postgres COPY needs that - this was years ago). I think total time to completion was also faster, as the index rebuilds were faster after the fact, than done during the loading (again I'm foggy so maybe inaccurate - this was years ago), but I was loading into a blank table -- rebuilding already constructed indexes might have an overwhelming cost. If you have to do a network based COPY from STDIN, then my gist above will hopefully show the way - and I wouldn't be surprised if running multiple of these simultaneously gets better performance than a single COPY. At that point, you'll just be fine tuning to max saturation on the network bandwidth - not the max capability of the postgres server, which the local disk COPY does. Hopefully this input is relevant - please ignore if I'm missing your goal entirely! Best, Steve --00000000000046230a05efa58afe Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Mon, Dec 12, 2022 at 9:= 36 AM Shane Borden <sborden76@gma= il.com> wrote:
I saw this as an option.=C2=A0 Let=E2=80=99s say you have a mill= ion row table.=C2=A0 If you copy from STDIN, do you have to spool the sourc= e table to a local file first?=C2=A0 Looking to replace a CTAS thats doing = this all within SQL statements now.
---

Thanks,


Shane Borden
sborden76@gmail.com

On Dec 12, 2022, at 12:25 PM, Rob S= argent <robjs= argent@gmail.com> wrote:

On 12/12/22 10:13, Shane= Borden wrote:
The issue is there are certain = performance benefits to be had by doing parallel CTAS operations and when c= onverting from Oracle to PostgreSQL switching to a =E2=80=9CCOPY=E2=80=9D o= peration isn=E2=80=99t feasible.
---

Thanks,

= Today I suspect you're left with something like the following:
- CTA= S from source where 1=3D2 (i.e. table definition via select semantics)
-= copy from stdin (filled with intended CTAS select)

I'm not at all familiar with Ora= cle / CTAS (fair warning). But I did spend some time, years ago, building a= fast import using COPY/STDIN in Postgres awhile back. I copied that code i= nto a Ruby language gist demonstrating the core concept recently.
The key idea, IMO, is to use an intermediate language like = Ruby to manage the "spooling" from the source file to STDIN. That= system could presumably fork processes to have more than one COPY from the= file going at a time (and coordinate each thread to access different rows = of the source file).

However, my experience = was that if you want the absolutely fastest import, you should copy the sou= rce file to a fast disk/array on the Postgres server itself, and then run t= he COPY command pointing to that file locally, not piping from STDIN over a= network connection. Postgres, at least in my testing, maxes I/O on a singl= e COPY command pretty heartily - so much so that you might find performance= otherwise becomes a problem, esp if your DBs are on that disk/array. IIRC,= I was seeing millions of rows loaded per second on a high performing Postg= res server, using this method. Also, IIRC, I removed the indexes from the t= able, again to max throughput, and then added them back after the COPY comp= leted (not sure if modern Postgres COPY needs that - this was years ago).= =C2=A0

I think total time to completion was also f= aster, as the index rebuilds were faster after the fact, than done during t= he loading (again I'm foggy so maybe inaccurate - this was years ago), = but I was loading into a blank table -- rebuilding already constructed inde= xes might have an overwhelming cost.

If you have t= o do a network based COPY from STDIN, then my gist above will hopefully sho= w the way - and I wouldn't be surprised if running multiple of these si= multaneously gets better performance than a single COPY. At that point, you= 'll just be fine tuning to max saturation on the network bandwidth - no= t the max capability of the postgres server, which the local disk COPY does= .

Hopefully this input is relevant - please ignore= if I'm missing your goal entirely!

Best,
Steve
=C2=A0
--00000000000046230a05efa58afe--