MIME-Version: 1.0
References: 
 <CAL5GnivMgBgRdY9YTLmAQKQa=TQVTRwghiGovK6Q6XxScdGOzg@mail.gmail.com>
 <CANzqJaA6B7XCyqxXFfdZMYTN5GNagHBdgEzbqwcti16N9wfcDA@mail.gmail.com>
 <CAEZv3cpESEGDUu-W5WSDo=LqORjk122YR7UOEdui6ujpTU-eAQ@mail.gmail.com>
 <CANzqJaBLCNnaHiOZdpgAiLgngSmKfbme-ZRot0yvjZcSiYfzHw@mail.gmail.com>
 <CAEZv3cp7bi_HXbi=NSgdcbtM7dX6rKzB57jwqNxo_76eExFJ5w@mail.gmail.com>
In-Reply-To: 
 <CAEZv3cp7bi_HXbi=NSgdcbtM7dX6rKzB57jwqNxo_76eExFJ5w@mail.gmail.com>
From: Ron Johnson <ronljohnsonjr@gmail.com>
Date: Fri, 30 May 2025 15:31:01 -0400
Message-ID: 
 <CANzqJaBygDCVnip18DJ_NKhwSewoJ=q7x3hRD3ev5Jj1y0wbQA@mail.gmail.com>
Subject: Re: Seeking Suggestions for Best Practices: Archiving and Migrating
 Historical Data in PostgreSQL
To: Andy Hartman <hartman60home@gmail.com>
Cc: Pgsql-admin <pgsql-admin@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000fca5c906365f74d5"
Archived-At: 
 <https://www.postgresql.org/message-id/CANzqJaBygDCVnip18DJ_NKhwSewoJ%3Dq7x3hRD3ev5Jj1y0wbQA%40mail.gmail.com>
Precedence: bulk

--000000000000fca5c906365f74d5
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

That's an unanswerable question, as I would not use Windows.  =F0=9F=98=81

Seriously though, since it's an image-heavy database full of PDF and TIFF
files, I'd do what I did on Linux when needing to migrate/upgrade a 6TB
(including indices) db from PG 9.6 to PG 14, and took four hours:
pg_dump -Z1 --jobs=3D16


On Fri, May 30, 2025 at 2:39=E2=80=AFPM Andy Hartman <hartman60home@gmail.c=
om>
wrote:

> What would you use for backup if PG hosted on Windows
>
> On Fri, May 30, 2025 at 2:10=E2=80=AFPM Ron Johnson <ronljohnsonjr@gmail.=
com>
> wrote:
>
>> Hmm... that was a few years ago, back when v12 was new.  It took about  =
a
>> month (mainly because they didn't want me running exports during "office
>> hours").
>>
>> There were 120 INSERT & SELECT (no UPDATE or DELETE) tables, so I was
>> able to add indices on date columns, create by-month views.  (We migrate=
d
>> the dozen or so *relatively* small UPDATE tables on cut-over day.  On
>> that same day, I migrated the current month and the previous month's dat=
a
>> in those 120 tables.
>>
>> I made separate cron jobs to:
>> - export views from Oracle into COPY-style tab-separated flat files,
>> - lz4-compress views that had finished exporting, and
>> - scp files that were finished compressing, to an AWS EC2 VM.
>>
>> These jobs pipelined, so there was always a job exporting, always a job
>> ready to compress tsv files, and another job ready to scp the lz4 files.
>> When there was nothing for a step to do, the job would sleep for a coupl=
e
>> of minutes, then check if there was more work to do.
>>
>> On the AWS EC2 VM, a different cron job waited for files to finish
>> transferring, then loaded them into the correct table. Just like with th=
e
>> source host jobs, the "load" job would sleep a bit and then check for mo=
re
>> work. I manually applied Indices.
>>
>> The AWS RDS PG12 database was about 4TB.  Snapshots were handled by AWS.
>> If this had been one of my on-prem systems, I'd have used pgbackrest.
>> (pgbackrest is impressively fast: takes good advantage of PG's 1GB file
>> max, and globs "small" files into one big file.)
>>
>> On Fri, May 30, 2025 at 12:15=E2=80=AFPM Andy Hartman <hartman60home@gma=
il.com>
>> wrote:
>>
>>> what was the duration start to finish of the migration of the 6tb of
>>> data. then what do you use for a quick backup after archived PG data
>>>
>>> Thanks.
>>>
>>> On Fri, May 30, 2025 at 11:29=E2=80=AFAM Ron Johnson <ronljohnsonjr@gma=
il.com>
>>> wrote:
>>>
>>>> On Fri, May 30, 2025 at 3:51=E2=80=AFAM Motog Plus <mplus7535@gmail.co=
m> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> We are currently planning a data archival initiative for our
>>>>> production PostgreSQL databases and would appreciate suggestions or
>>>>> insights from the community regarding best practices and proven appro=
aches.
>>>>>
>>>>> **Scenario:**
>>>>> - We have a few large tables (several hundred million rows) where we
>>>>> want to archive historical data (e.g., older than 1 year).
>>>>> - The archived data should be moved to a separate PostgreSQL database
>>>>> (on a same or different server).
>>>>> - Our goals are: efficient data movement, minimal downtime, and safe
>>>>> deletion from the source after successful archival.
>>>>>
>>>>> - PostgreSQL version: 15.12
>>>>> - Both source and target databases are PostgreSQL.
>>>>>
>>>>> We explored using `COPY TO` and `COPY FROM` with CSV files, uploaded
>>>>> to a SharePoint or similar storage system. However, our infrastructur=
e team
>>>>> raised concerns around the computational load of large CSV processing=
 and
>>>>> potential security implications with file transfers.
>>>>>
>>>>> We=E2=80=99d like to understand:
>>>>> - What approaches have worked well for you in practice?
>>>>>
>>>>
>>>> This is how I migrated 6TB of data from an Oracle database to
>>>> Postgresql, and then implemented quarterly archiving of the PG databas=
e:
>>>> - COPY FROM (SELECT * FROM live_table WHERE date_fld in
>>>> some_manageable_date_range) TO STDOUT.
>>>> - Compress
>>>> - scp
>>>> - COPY TO archive_table.
>>>> - Index
>>>> - DELETE FROM live_table WHERE date_fld in some_manageable_date_range
>>>> (This I only did in the PG archive process
>>>>
>>>> (Naturally, the Oracle migration used Oracle-specific commands.)
>>>>
>>>> - Are there specific tools or strategies you=E2=80=99d recommend for o=
ngoing
>>>>> archival?
>>>>>
>>>>
>>>> I write generic bash loops to which you pass an array that contains th=
e
>>>> table name, PK, date column and date range.
>>>>
>>>> Given a list of tables, it did the COPY FROM, lz4 and scp.  Once that
>>>> finished successfully, another script dropped archive indices on the
>>>> current table, COPY TO and CREATE INDEX statements.  A third script di=
d the
>>>> deletes.
>>>>
>>>> This works even when the live database tables are all connected via
>>>> FK.  You just need to carefully order the tables in your script.
>>>>
>>>>
>>>>> - Any performance or consistency issues we should watch out for?
>>>>>
>>>>
>>>> My rules for scripting are "bite-sized pieces" and "check those return
>>>> codes!".
>>>>
>>>>
>>>>> Your insights or any relevant documentation/pointers would be
>>>>> immensely helpful.
>>>>>
>>>>
>>>> Index support uber alles.  When deleting from a table which relies on =
a
>>>> foreign key link to a table which _does_ have a date field, don't hesi=
tate
>>>> to join on that table.
>>>>
>>>> And DELETE of bite-sized chunks is faster than people give it credit
>>>> for.
>>>>
>>>> --
>>>> Death to <Redacted>, and butter sauce.
>>>> Don't boil me, I'm still alive.
>>>> <Redacted> lobster!
>>>>
>>>
>>
>> --
>> Death to <Redacted>, and butter sauce.
>> Don't boil me, I'm still alive.
>> <Redacted> lobster!
>>
>

--=20
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

--000000000000fca5c906365f74d5
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">That&#39;s an unanswerable question, as I would not use Wi=
ndows.=C2=A0=C2=A0=F0=9F=98=81<div><br></div><div>Seriously though, since i=
t&#39;s an image-heavy database full of PDF and TIFF files, I&#39;d do what=
 I did on Linux when needing to migrate/upgrade a 6TB (including indices) d=
b from PG 9.6 to PG 14, and took four hours:</div><div>pg_dump -Z1 --jobs=
=3D16</div><div><br></div></div><br><div class=3D"gmail_quote gmail_quote_c=
ontainer"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, May 30, 2025 at 2:3=
9=E2=80=AFPM Andy Hartman &lt;<a href=3D"mailto:hartman60home@gmail.com">ha=
rtman60home@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,20=
4);padding-left:1ex"><div dir=3D"ltr">What would you use for backup if PG h=
osted on Windows</div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=
=3D"gmail_attr">On Fri, May 30, 2025 at 2:10=E2=80=AFPM Ron Johnson &lt;<a =
href=3D"mailto:ronljohnsonjr@gmail.com" target=3D"_blank">ronljohnsonjr@gma=
il.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"m=
argin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left=
:1ex"><div dir=3D"ltr"><div>Hmm... that was a few years ago, back when v12 =
was new.=C2=A0 It took about=C2=A0 a month (mainly because they didn&#39;t =
want=C2=A0me running exports during &quot;office hours&quot;).</div><div><b=
r></div><div>There were 120 INSERT &amp; SELECT (no UPDATE or DELETE) table=
s, so I was able to add indices on date columns, create by-month views.=C2=
=A0 (We migrated the dozen or so <i>relatively</i> small UPDATE tables on c=
ut-over day.=C2=A0 On that same day, I migrated the current month and the p=
revious month&#39;s data in those 120 tables.</div><div><br></div><div>I ma=
de separate cron jobs to:</div><div>- export views from Oracle into COPY-st=
yle tab-separated flat files,=C2=A0</div><div>- lz4-compress views that had=
 finished exporting, and</div><div>- scp files that were finished compressi=
ng, to an AWS EC2 VM.</div><div><br></div><div>These jobs pipelined, so the=
re was always a job exporting, always a job ready to compress tsv files, an=
d another job ready to scp the lz4 files.=C2=A0 When there was nothing for =
a step to do, the job would sleep for a couple of minutes, then check if th=
ere was more work to do.</div><div><br></div><div>On the AWS EC2 VM, a diff=
erent cron job waited for files to finish transferring, then loaded them in=
to the correct table. Just like with the source host jobs, the &quot;load&q=
uot; job would sleep a bit and then check for more work. I manually applied=
 Indices.</div><div><br></div><div>The AWS RDS PG12 database was about 4TB.=
=C2=A0 Snapshots were handled by AWS.=C2=A0 If this had been one of my on-p=
rem systems, I&#39;d have used pgbackrest.=C2=A0 (pgbackrest is impressivel=
y fast: takes good advantage of PG&#39;s 1GB file max, and globs &quot;smal=
l&quot; files into one big file.)</div><br><div class=3D"gmail_quote"><div =
dir=3D"ltr" class=3D"gmail_attr">On Fri, May 30, 2025 at 12:15=E2=80=AFPM A=
ndy Hartman &lt;<a href=3D"mailto:hartman60home@gmail.com" target=3D"_blank=
">hartman60home@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmai=
l_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,20=
4,204);padding-left:1ex"><div dir=3D"ltr">what was the duration start to fi=
nish of the migration of the 6tb of data. then what do you use for a quick =
backup after archived PG data=C2=A0<br><br>Thanks.</div><br><div class=3D"g=
mail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, May 30, 2025 at 1=
1:29=E2=80=AFAM Ron Johnson &lt;<a href=3D"mailto:ronljohnsonjr@gmail.com" =
target=3D"_blank">ronljohnsonjr@gmail.com</a>&gt; wrote:<br></div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div dir=3D"ltr">=
On Fri, May 30, 2025 at 3:51=E2=80=AFAM Motog Plus &lt;<a href=3D"mailto:mp=
lus7535@gmail.com" target=3D"_blank">mplus7535@gmail.com</a>&gt; wrote:</di=
v><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1=
ex"><div dir=3D"auto">Hi Team,<div dir=3D"auto"><br></div><div dir=3D"auto"=
>We are currently planning a data archival initiative for our production Po=
stgreSQL databases and would appreciate suggestions or insights from the co=
mmunity regarding best practices and proven approaches.</div><div dir=3D"au=
to"><br></div><div dir=3D"auto">**Scenario:**</div><div dir=3D"auto">- We h=
ave a few large tables (several hundred million rows) where we want to arch=
ive historical data (e.g., older than 1 year).</div><div dir=3D"auto">- The=
 archived data should be moved to a separate PostgreSQL database (on a same=
 or different server).</div><div dir=3D"auto">- Our goals are: efficient da=
ta movement, minimal downtime, and safe deletion from the source after succ=
essful archival.</div><div dir=3D"auto"><br></div><div dir=3D"auto">- Postg=
reSQL version: 15.12</div><div dir=3D"auto">- Both source and target databa=
ses are PostgreSQL.</div><div dir=3D"auto"><br></div><div dir=3D"auto">We e=
xplored using `COPY TO` and `COPY FROM` with CSV files, uploaded to a Share=
Point or similar storage system. However, our infrastructure team raised co=
ncerns around the computational load of large CSV processing and potential =
security implications with file transfers.</div><div dir=3D"auto"><br></div=
><div dir=3D"auto">We=E2=80=99d like to understand:</div><div dir=3D"auto">=
- What approaches have worked well for you in practice?</div></div></blockq=
uote><div><br></div><div>This is how I migrated 6TB of data from an Oracle =
database to Postgresql, and then implemented quarterly archiving of the PG =
database:</div><div>- COPY FROM (SELECT * FROM live_table WHERE date_fld in=
 some_manageable_date_range) TO STDOUT.</div><div>- Compress</div><div>- sc=
p</div><div>- COPY TO archive_table.</div><div>- Index</div><div>- DELETE F=
ROM live_table WHERE date_fld in some_manageable_date_range=C2=A0 (This I o=
nly did in the PG archive process</div><div>=C2=A0</div><div>(Naturally, th=
e Oracle migration used Oracle-specific commands.)</div><div><br></div><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left=
:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"auto"><div dir=3D=
"auto">- Are there specific tools or strategies you=E2=80=99d recommend for=
 ongoing archival?</div></div></blockquote><div><br></div><div>I write gene=
ric bash loops to which you pass an array that contains the table name,=C2=
=A0PK,=C2=A0date column and date range.</div><div><br></div><div>Given a li=
st of tables, it did the COPY FROM, lz4 and scp.=C2=A0 Once that finished s=
uccessfully, another script dropped=C2=A0archive indices on the current tab=
le, COPY TO and CREATE INDEX statements.=C2=A0 A third script did the delet=
es.</div><div><br></div><div>This works even when the live database tables =
are all connected via FK.=C2=A0 You just need to carefully order the tables=
 in your script.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padd=
ing-left:1ex"><div dir=3D"auto"><div dir=3D"auto">- Any performance or cons=
istency issues we should watch out for?</div></div></blockquote><div><br></=
div><div>My rules=C2=A0for=C2=A0scripting are &quot;bite-sized pieces&quot;=
 and &quot;check those return codes!&quot;.</div><div>=C2=A0</div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex"><div dir=3D"auto"><div dir=3D"auto=
">Your insights or any relevant documentation/pointers would be immensely h=
elpful.</div></div></blockquote><div>=C2=A0</div><div>Index support uber al=
les.=C2=A0 When deleting from a table which relies on a foreign key link to=
 a table which _does_ have a date field, don&#39;t hesitate to join on that=
 table.</div><div><br></div><div>And DELETE of bite-sized chunks is faster =
than people give it credit for.</div><div><br></div></div><span class=3D"gm=
ail_signature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signatu=
re"><div dir=3D"ltr">Death to &lt;Redacted&gt;, and butter sauce.<div>Don&#=
39;t boil me, I&#39;m still alive.<br><div><div>&lt;Redacted&gt; lobster!</=
div></div></div></div></div></div>
</blockquote></div>
</blockquote></div><div><br clear=3D"all"></div><div><br></div><span class=
=3D"gmail_signature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_s=
ignature"><div dir=3D"ltr">Death to &lt;Redacted&gt;, and butter sauce.<div=
>Don&#39;t boil me, I&#39;m still alive.<br><div><div>&lt;Redacted&gt; lobs=
ter!</div></div></div></div></div></div>
</blockquote></div>
</blockquote></div><div><br clear=3D"all"></div><div><br></div><span class=
=3D"gmail_signature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_s=
ignature"><div dir=3D"ltr">Death to &lt;Redacted&gt;, and butter sauce.<div=
>Don&#39;t boil me, I&#39;m still alive.<br><div><div>&lt;Redacted&gt; lobs=
ter!</div></div></div></div></div>

--000000000000fca5c906365f74d5--