MIME-Version: 1.0
References: 
 <CAL5GnivMgBgRdY9YTLmAQKQa=TQVTRwghiGovK6Q6XxScdGOzg@mail.gmail.com>
 <CANzqJaA6B7XCyqxXFfdZMYTN5GNagHBdgEzbqwcti16N9wfcDA@mail.gmail.com>
 <CAEZv3cpESEGDUu-W5WSDo=LqORjk122YR7UOEdui6ujpTU-eAQ@mail.gmail.com>
 <CANzqJaBLCNnaHiOZdpgAiLgngSmKfbme-ZRot0yvjZcSiYfzHw@mail.gmail.com>
In-Reply-To: 
 <CANzqJaBLCNnaHiOZdpgAiLgngSmKfbme-ZRot0yvjZcSiYfzHw@mail.gmail.com>
From: Andy Hartman <hartman60home@gmail.com>
Date: Fri, 30 May 2025 14:39:35 -0400
Message-ID: 
 <CAEZv3cp7bi_HXbi=NSgdcbtM7dX6rKzB57jwqNxo_76eExFJ5w@mail.gmail.com>
Subject: Re: Seeking Suggestions for Best Practices: Archiving and Migrating
 Historical Data in PostgreSQL
To: Ron Johnson <ronljohnsonjr@gmail.com>
Cc: Pgsql-admin <pgsql-admin@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="00000000000003c3fa06365ebdfe"
Archived-At: 
 <https://www.postgresql.org/message-id/CAEZv3cp7bi_HXbi%3DNSgdcbtM7dX6rKzB57jwqNxo_76eExFJ5w%40mail.gmail.com>
Precedence: bulk

--00000000000003c3fa06365ebdfe
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

What would you use for backup if PG hosted on Windows

On Fri, May 30, 2025 at 2:10=E2=80=AFPM Ron Johnson <ronljohnsonjr@gmail.co=
m> wrote:

> Hmm... that was a few years ago, back when v12 was new.  It took about  a
> month (mainly because they didn't want me running exports during "office
> hours").
>
> There were 120 INSERT & SELECT (no UPDATE or DELETE) tables, so I was abl=
e
> to add indices on date columns, create by-month views.  (We migrated the
> dozen or so *relatively* small UPDATE tables on cut-over day.  On that
> same day, I migrated the current month and the previous month's data in
> those 120 tables.
>
> I made separate cron jobs to:
> - export views from Oracle into COPY-style tab-separated flat files,
> - lz4-compress views that had finished exporting, and
> - scp files that were finished compressing, to an AWS EC2 VM.
>
> These jobs pipelined, so there was always a job exporting, always a job
> ready to compress tsv files, and another job ready to scp the lz4 files.
> When there was nothing for a step to do, the job would sleep for a couple
> of minutes, then check if there was more work to do.
>
> On the AWS EC2 VM, a different cron job waited for files to finish
> transferring, then loaded them into the correct table. Just like with the
> source host jobs, the "load" job would sleep a bit and then check for mor=
e
> work. I manually applied Indices.
>
> The AWS RDS PG12 database was about 4TB.  Snapshots were handled by AWS.
> If this had been one of my on-prem systems, I'd have used pgbackrest.
> (pgbackrest is impressively fast: takes good advantage of PG's 1GB file
> max, and globs "small" files into one big file.)
>
> On Fri, May 30, 2025 at 12:15=E2=80=AFPM Andy Hartman <hartman60home@gmai=
l.com>
> wrote:
>
>> what was the duration start to finish of the migration of the 6tb of
>> data. then what do you use for a quick backup after archived PG data
>>
>> Thanks.
>>
>> On Fri, May 30, 2025 at 11:29=E2=80=AFAM Ron Johnson <ronljohnsonjr@gmai=
l.com>
>> wrote:
>>
>>> On Fri, May 30, 2025 at 3:51=E2=80=AFAM Motog Plus <mplus7535@gmail.com=
> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> We are currently planning a data archival initiative for our productio=
n
>>>> PostgreSQL databases and would appreciate suggestions or insights from=
 the
>>>> community regarding best practices and proven approaches.
>>>>
>>>> **Scenario:**
>>>> - We have a few large tables (several hundred million rows) where we
>>>> want to archive historical data (e.g., older than 1 year).
>>>> - The archived data should be moved to a separate PostgreSQL database
>>>> (on a same or different server).
>>>> - Our goals are: efficient data movement, minimal downtime, and safe
>>>> deletion from the source after successful archival.
>>>>
>>>> - PostgreSQL version: 15.12
>>>> - Both source and target databases are PostgreSQL.
>>>>
>>>> We explored using `COPY TO` and `COPY FROM` with CSV files, uploaded t=
o
>>>> a SharePoint or similar storage system. However, our infrastructure te=
am
>>>> raised concerns around the computational load of large CSV processing =
and
>>>> potential security implications with file transfers.
>>>>
>>>> We=E2=80=99d like to understand:
>>>> - What approaches have worked well for you in practice?
>>>>
>>>
>>> This is how I migrated 6TB of data from an Oracle database to
>>> Postgresql, and then implemented quarterly archiving of the PG database=
:
>>> - COPY FROM (SELECT * FROM live_table WHERE date_fld in
>>> some_manageable_date_range) TO STDOUT.
>>> - Compress
>>> - scp
>>> - COPY TO archive_table.
>>> - Index
>>> - DELETE FROM live_table WHERE date_fld in some_manageable_date_range
>>> (This I only did in the PG archive process
>>>
>>> (Naturally, the Oracle migration used Oracle-specific commands.)
>>>
>>> - Are there specific tools or strategies you=E2=80=99d recommend for on=
going
>>>> archival?
>>>>
>>>
>>> I write generic bash loops to which you pass an array that contains the
>>> table name, PK, date column and date range.
>>>
>>> Given a list of tables, it did the COPY FROM, lz4 and scp.  Once that
>>> finished successfully, another script dropped archive indices on the
>>> current table, COPY TO and CREATE INDEX statements.  A third script did=
 the
>>> deletes.
>>>
>>> This works even when the live database tables are all connected via FK.
>>> You just need to carefully order the tables in your script.
>>>
>>>
>>>> - Any performance or consistency issues we should watch out for?
>>>>
>>>
>>> My rules for scripting are "bite-sized pieces" and "check those return
>>> codes!".
>>>
>>>
>>>> Your insights or any relevant documentation/pointers would be immensel=
y
>>>> helpful.
>>>>
>>>
>>> Index support uber alles.  When deleting from a table which relies on a
>>> foreign key link to a table which _does_ have a date field, don't hesit=
ate
>>> to join on that table.
>>>
>>> And DELETE of bite-sized chunks is faster than people give it credit fo=
r.
>>>
>>> --
>>> Death to <Redacted>, and butter sauce.
>>> Don't boil me, I'm still alive.
>>> <Redacted> lobster!
>>>
>>
>
> --
> Death to <Redacted>, and butter sauce.
> Don't boil me, I'm still alive.
> <Redacted> lobster!
>

--00000000000003c3fa06365ebdfe
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">What would you use for backup if PG hosted on Windows</div=
><br><div class=3D"gmail_quote gmail_quote_container"><div dir=3D"ltr" clas=
s=3D"gmail_attr">On Fri, May 30, 2025 at 2:10=E2=80=AFPM Ron Johnson &lt;<a=
 href=3D"mailto:ronljohnsonjr@gmail.com">ronljohnsonjr@gmail.com</a>&gt; wr=
ote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D=
"ltr"><div>Hmm... that was a few years ago, back when v12 was new.=C2=A0 It=
 took about=C2=A0 a month (mainly because they didn&#39;t want=C2=A0me runn=
ing exports during &quot;office hours&quot;).</div><div><br></div><div>Ther=
e were 120 INSERT &amp; SELECT (no UPDATE or DELETE) tables, so I was able =
to add indices on date columns, create by-month views.=C2=A0 (We migrated t=
he dozen or so <i>relatively</i> small UPDATE tables on cut-over day.=C2=A0=
 On that same day, I migrated the current month and the previous month&#39;=
s data in those 120 tables.</div><div><br></div><div>I made separate cron j=
obs to:</div><div>- export views from Oracle into COPY-style tab-separated =
flat files,=C2=A0</div><div>- lz4-compress views that had finished exportin=
g, and</div><div>- scp files that were finished compressing, to an AWS EC2 =
VM.</div><div><br></div><div>These jobs pipelined, so there was always a jo=
b exporting, always a job ready to compress tsv files, and another job read=
y to scp the lz4 files.=C2=A0 When there was nothing for a step to do, the =
job would sleep for a couple of minutes, then check if there was more work =
to do.</div><div><br></div><div>On the AWS EC2 VM, a different cron job wai=
ted for files to finish transferring, then loaded them into the correct tab=
le. Just like with the source host jobs, the &quot;load&quot; job would sle=
ep a bit and then check for more work. I manually applied Indices.</div><di=
v><br></div><div>The AWS RDS PG12 database was about 4TB.=C2=A0 Snapshots w=
ere handled by AWS.=C2=A0 If this had been one of my on-prem systems, I&#39=
;d have used pgbackrest.=C2=A0 (pgbackrest is impressively fast: takes good=
 advantage of PG&#39;s 1GB file max, and globs &quot;small&quot; files into=
 one big file.)</div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=
=3D"gmail_attr">On Fri, May 30, 2025 at 12:15=E2=80=AFPM Andy Hartman &lt;<=
a href=3D"mailto:hartman60home@gmail.com" target=3D"_blank">hartman60home@g=
mail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D=
"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-le=
ft:1ex"><div dir=3D"ltr">what was the duration start to finish of the migra=
tion of the 6tb of data. then what do you use for a quick backup after arch=
ived PG data=C2=A0<br><br>Thanks.</div><br><div class=3D"gmail_quote"><div =
dir=3D"ltr" class=3D"gmail_attr">On Fri, May 30, 2025 at 11:29=E2=80=AFAM R=
on Johnson &lt;<a href=3D"mailto:ronljohnsonjr@gmail.com" target=3D"_blank"=
>ronljohnsonjr@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail=
_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204=
,204);padding-left:1ex"><div dir=3D"ltr"><div dir=3D"ltr">On Fri, May 30, 2=
025 at 3:51=E2=80=AFAM Motog Plus &lt;<a href=3D"mailto:mplus7535@gmail.com=
" target=3D"_blank">mplus7535@gmail.com</a>&gt; wrote:</div><div class=3D"g=
mail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0=
.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"a=
uto">Hi Team,<div dir=3D"auto"><br></div><div dir=3D"auto">We are currently=
 planning a data archival initiative for our production PostgreSQL database=
s and would appreciate suggestions or insights from the community regarding=
 best practices and proven approaches.</div><div dir=3D"auto"><br></div><di=
v dir=3D"auto">**Scenario:**</div><div dir=3D"auto">- We have a few large t=
ables (several hundred million rows) where we want to archive historical da=
ta (e.g., older than 1 year).</div><div dir=3D"auto">- The archived data sh=
ould be moved to a separate PostgreSQL database (on a same or different ser=
ver).</div><div dir=3D"auto">- Our goals are: efficient data movement, mini=
mal downtime, and safe deletion from the source after successful archival.<=
/div><div dir=3D"auto"><br></div><div dir=3D"auto">- PostgreSQL version: 15=
.12</div><div dir=3D"auto">- Both source and target databases are PostgreSQ=
L.</div><div dir=3D"auto"><br></div><div dir=3D"auto">We explored using `CO=
PY TO` and `COPY FROM` with CSV files, uploaded to a SharePoint or similar =
storage system. However, our infrastructure team raised concerns around the=
 computational load of large CSV processing and potential security implicat=
ions with file transfers.</div><div dir=3D"auto"><br></div><div dir=3D"auto=
">We=E2=80=99d like to understand:</div><div dir=3D"auto">- What approaches=
 have worked well for you in practice?</div></div></blockquote><div><br></d=
iv><div>This is how I migrated 6TB of data from an Oracle database to Postg=
resql, and then implemented quarterly archiving of the PG database:</div><d=
iv>- COPY FROM (SELECT * FROM live_table WHERE date_fld in some_manageable_=
date_range) TO STDOUT.</div><div>- Compress</div><div>- scp</div><div>- COP=
Y TO archive_table.</div><div>- Index</div><div>- DELETE FROM live_table WH=
ERE date_fld in some_manageable_date_range=C2=A0 (This I only did in the PG=
 archive process</div><div>=C2=A0</div><div>(Naturally, the Oracle migratio=
n used Oracle-specific commands.)</div><div><br></div><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(20=
4,204,204);padding-left:1ex"><div dir=3D"auto"><div dir=3D"auto">- Are ther=
e specific tools or strategies you=E2=80=99d recommend for ongoing archival=
?</div></div></blockquote><div><br></div><div>I write generic bash loops to=
 which you pass an array that contains the table name,=C2=A0PK,=C2=A0date c=
olumn and date range.</div><div><br></div><div>Given a list of tables, it d=
id the COPY FROM, lz4 and scp.=C2=A0 Once that finished successfully, anoth=
er script dropped=C2=A0archive indices on the current table, COPY TO and CR=
EATE INDEX statements.=C2=A0 A third script did the deletes.</div><div><br>=
</div><div>This works even when the live database tables are all connected =
via FK.=C2=A0 You just need to carefully order the tables in your script.</=
div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div=
 dir=3D"auto"><div dir=3D"auto">- Any performance or consistency issues we =
should watch out for?</div></div></blockquote><div><br></div><div>My rules=
=C2=A0for=C2=A0scripting are &quot;bite-sized pieces&quot; and &quot;check =
those return codes!&quot;.</div><div>=C2=A0</div><blockquote class=3D"gmail=
_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204=
,204);padding-left:1ex"><div dir=3D"auto"><div dir=3D"auto">Your insights o=
r any relevant documentation/pointers would be immensely helpful.</div></di=
v></blockquote><div>=C2=A0</div><div>Index support uber alles.=C2=A0 When d=
eleting from a table which relies on a foreign key link to a table which _d=
oes_ have a date field, don&#39;t hesitate to join on that table.</div><div=
><br></div><div>And DELETE of bite-sized chunks is faster than people give =
it credit for.</div><div><br></div></div><span class=3D"gmail_signature_pre=
fix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><div dir=3D"l=
tr">Death to &lt;Redacted&gt;, and butter sauce.<div>Don&#39;t boil me, I&#=
39;m still alive.<br><div><div>&lt;Redacted&gt; lobster!</div></div></div><=
/div></div></div>
</blockquote></div>
</blockquote></div><div><br clear=3D"all"></div><div><br></div><span class=
=3D"gmail_signature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_s=
ignature"><div dir=3D"ltr">Death to &lt;Redacted&gt;, and butter sauce.<div=
>Don&#39;t boil me, I&#39;m still alive.<br><div><div>&lt;Redacted&gt; lobs=
ter!</div></div></div></div></div></div>
</blockquote></div>

--00000000000003c3fa06365ebdfe--