MIME-Version: 1.0
References: 
 <CAL5GnivMgBgRdY9YTLmAQKQa=TQVTRwghiGovK6Q6XxScdGOzg@mail.gmail.com>
 <CANzqJaA6B7XCyqxXFfdZMYTN5GNagHBdgEzbqwcti16N9wfcDA@mail.gmail.com>
 <CAEZv3cpESEGDUu-W5WSDo=LqORjk122YR7UOEdui6ujpTU-eAQ@mail.gmail.com>
 <CANzqJaBLCNnaHiOZdpgAiLgngSmKfbme-ZRot0yvjZcSiYfzHw@mail.gmail.com>
 <CAEZv3cp7bi_HXbi=NSgdcbtM7dX6rKzB57jwqNxo_76eExFJ5w@mail.gmail.com>
 <CANzqJaBygDCVnip18DJ_NKhwSewoJ=q7x3hRD3ev5Jj1y0wbQA@mail.gmail.com>
In-Reply-To: 
 <CANzqJaBygDCVnip18DJ_NKhwSewoJ=q7x3hRD3ev5Jj1y0wbQA@mail.gmail.com>
From: Motog Plus <mplus7535@gmail.com>
Date: Mon, 2 Jun 2025 18:25:32 +0530
Message-ID: 
 <CAL5Gnivs1MKypYvrOGFLyk73KG8wmg1qAint=pkMdTkCGBEXMQ@mail.gmail.com>
Subject: Re: Seeking Suggestions for Best Practices: Archiving and Migrating
 Historical Data in PostgreSQL
To: Ron Johnson <ronljohnsonjr@gmail.com>,
 Pgsql-admin <pgsql-admin@lists.postgresql.org>
Cc: Andy Hartman <hartman60home@gmail.com>
Content-Type: multipart/alternative; boundary="0000000000002492080636964816"
Archived-At: 
 <https://www.postgresql.org/message-id/CAL5Gnivs1MKypYvrOGFLyk73KG8wmg1qAint%3DpkMdTkCGBEXMQ%40mail.gmail.com>
Precedence: bulk

--0000000000002492080636964816
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Everyone,

Thank you all for the helpful suggestions, insights, and follow-up
questions. I truly appreciate the time and effort you=E2=80=99ve taken to s=
hare
your experiences and recommendations.

To answer one of the common questions: **yes, we are using partitioned
tables**, primarily based on a timestamp column. This setup is already
helping us manage and isolate historical data more effectively.

The input from this community has been incredibly valuable in helping us
shape our archival approach. We=E2=80=99re currently evaluating a few optio=
ns based
on your feedback and will proceed with a solution that best balances
efficiency, reliability, and security.

We may reach out again with more specific questions or for further
suggestions once we finalize the approach and start implementation.

Thanks again for your support!

Best regards,
Ramzy

On Sat, May 31, 2025, 01:01 Ron Johnson <ronljohnsonjr@gmail.com> wrote:

> That's an unanswerable question, as I would not use Windows.  =F0=9F=98=
=81
>
> Seriously though, since it's an image-heavy database full of PDF and TIFF
> files, I'd do what I did on Linux when needing to migrate/upgrade a 6TB
> (including indices) db from PG 9.6 to PG 14, and took four hours:
> pg_dump -Z1 --jobs=3D16
>
>
> On Fri, May 30, 2025 at 2:39=E2=80=AFPM Andy Hartman <hartman60home@gmail=
.com>
> wrote:
>
>> What would you use for backup if PG hosted on Windows
>>
>> On Fri, May 30, 2025 at 2:10=E2=80=AFPM Ron Johnson <ronljohnsonjr@gmail=
.com>
>> wrote:
>>
>>> Hmm... that was a few years ago, back when v12 was new.  It took about
>>> a month (mainly because they didn't want me running exports during "off=
ice
>>> hours").
>>>
>>> There were 120 INSERT & SELECT (no UPDATE or DELETE) tables, so I was
>>> able to add indices on date columns, create by-month views.  (We migrat=
ed
>>> the dozen or so *relatively* small UPDATE tables on cut-over day.  On
>>> that same day, I migrated the current month and the previous month's da=
ta
>>> in those 120 tables.
>>>
>>> I made separate cron jobs to:
>>> - export views from Oracle into COPY-style tab-separated flat files,
>>> - lz4-compress views that had finished exporting, and
>>> - scp files that were finished compressing, to an AWS EC2 VM.
>>>
>>> These jobs pipelined, so there was always a job exporting, always a job
>>> ready to compress tsv files, and another job ready to scp the lz4 files=
.
>>> When there was nothing for a step to do, the job would sleep for a coup=
le
>>> of minutes, then check if there was more work to do.
>>>
>>> On the AWS EC2 VM, a different cron job waited for files to finish
>>> transferring, then loaded them into the correct table. Just like with t=
he
>>> source host jobs, the "load" job would sleep a bit and then check for m=
ore
>>> work. I manually applied Indices.
>>>
>>> The AWS RDS PG12 database was about 4TB.  Snapshots were handled by
>>> AWS.  If this had been one of my on-prem systems, I'd have used
>>> pgbackrest.  (pgbackrest is impressively fast: takes good advantage of =
PG's
>>> 1GB file max, and globs "small" files into one big file.)
>>>
>>> On Fri, May 30, 2025 at 12:15=E2=80=AFPM Andy Hartman <hartman60home@gm=
ail.com>
>>> wrote:
>>>
>>>> what was the duration start to finish of the migration of the 6tb of
>>>> data. then what do you use for a quick backup after archived PG data
>>>>
>>>> Thanks.
>>>>
>>>> On Fri, May 30, 2025 at 11:29=E2=80=AFAM Ron Johnson <ronljohnsonjr@gm=
ail.com>
>>>> wrote:
>>>>
>>>>> On Fri, May 30, 2025 at 3:51=E2=80=AFAM Motog Plus <mplus7535@gmail.c=
om>
>>>>> wrote:
>>>>>
>>>>>> Hi Team,
>>>>>>
>>>>>> We are currently planning a data archival initiative for our
>>>>>> production PostgreSQL databases and would appreciate suggestions or
>>>>>> insights from the community regarding best practices and proven appr=
oaches.
>>>>>>
>>>>>> **Scenario:**
>>>>>> - We have a few large tables (several hundred million rows) where we
>>>>>> want to archive historical data (e.g., older than 1 year).
>>>>>> - The archived data should be moved to a separate PostgreSQL databas=
e
>>>>>> (on a same or different server).
>>>>>> - Our goals are: efficient data movement, minimal downtime, and safe
>>>>>> deletion from the source after successful archival.
>>>>>>
>>>>>> - PostgreSQL version: 15.12
>>>>>> - Both source and target databases are PostgreSQL.
>>>>>>
>>>>>> We explored using `COPY TO` and `COPY FROM` with CSV files, uploaded
>>>>>> to a SharePoint or similar storage system. However, our infrastructu=
re team
>>>>>> raised concerns around the computational load of large CSV processin=
g and
>>>>>> potential security implications with file transfers.
>>>>>>
>>>>>> We=E2=80=99d like to understand:
>>>>>> - What approaches have worked well for you in practice?
>>>>>>
>>>>>
>>>>> This is how I migrated 6TB of data from an Oracle database to
>>>>> Postgresql, and then implemented quarterly archiving of the PG databa=
se:
>>>>> - COPY FROM (SELECT * FROM live_table WHERE date_fld in
>>>>> some_manageable_date_range) TO STDOUT.
>>>>> - Compress
>>>>> - scp
>>>>> - COPY TO archive_table.
>>>>> - Index
>>>>> - DELETE FROM live_table WHERE date_fld in some_manageable_date_range
>>>>> (This I only did in the PG archive process
>>>>>
>>>>> (Naturally, the Oracle migration used Oracle-specific commands.)
>>>>>
>>>>> - Are there specific tools or strategies you=E2=80=99d recommend for =
ongoing
>>>>>> archival?
>>>>>>
>>>>>
>>>>> I write generic bash loops to which you pass an array that contains
>>>>> the table name, PK, date column and date range.
>>>>>
>>>>> Given a list of tables, it did the COPY FROM, lz4 and scp.  Once that
>>>>> finished successfully, another script dropped archive indices on the
>>>>> current table, COPY TO and CREATE INDEX statements.  A third script d=
id the
>>>>> deletes.
>>>>>
>>>>> This works even when the live database tables are all connected via
>>>>> FK.  You just need to carefully order the tables in your script.
>>>>>
>>>>>
>>>>>> - Any performance or consistency issues we should watch out for?
>>>>>>
>>>>>
>>>>> My rules for scripting are "bite-sized pieces" and "check those retur=
n
>>>>> codes!".
>>>>>
>>>>>
>>>>>> Your insights or any relevant documentation/pointers would be
>>>>>> immensely helpful.
>>>>>>
>>>>>
>>>>> Index support uber alles.  When deleting from a table which relies on
>>>>> a foreign key link to a table which _does_ have a date field, don't
>>>>> hesitate to join on that table.
>>>>>
>>>>> And DELETE of bite-sized chunks is faster than people give it credit
>>>>> for.
>>>>>
>>>>> --
>>>>> Death to <Redacted>, and butter sauce.
>>>>> Don't boil me, I'm still alive.
>>>>> <Redacted> lobster!
>>>>>
>>>>
>>>
>>> --
>>> Death to <Redacted>, and butter sauce.
>>> Don't boil me, I'm still alive.
>>> <Redacted> lobster!
>>>
>>
>
> --
> Death to <Redacted>, and butter sauce.
> Don't boil me, I'm still alive.
> <Redacted> lobster!
>

--0000000000002492080636964816
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div dir=3D"auto">Hi Everyone,</div><div dir=3D"auto"><br=
></div><div dir=3D"auto">Thank you all for the helpful suggestions, insight=
s, and follow-up questions. I truly appreciate the time and effort you=E2=
=80=99ve taken to share your experiences and recommendations.</div><div dir=
=3D"auto"><br></div><div dir=3D"auto">To answer one of the common questions=
: **yes, we are using partitioned tables**, primarily based on a timestamp =
column. This setup is already helping us manage and isolate historical data=
 more effectively.</div><div dir=3D"auto"><br></div><div dir=3D"auto">The i=
nput from this community has been incredibly valuable in helping us shape o=
ur archival approach. We=E2=80=99re currently evaluating a few options base=
d on your feedback and will proceed with a solution that best balances effi=
ciency, reliability, and security.</div><div dir=3D"auto"><br></div><div di=
r=3D"auto">We may reach out again with more specific questions or for furth=
er suggestions once we finalize the approach and start implementation.</div=
><div dir=3D"auto"><br></div><div dir=3D"auto">Thanks again for your suppor=
t!</div><div dir=3D"auto"><br></div><div dir=3D"auto">Best regards,=C2=A0=
=C2=A0</div><div dir=3D"auto">Ramzy</div></div><br><div class=3D"gmail_quot=
e gmail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr">On Sat, May =
31, 2025, 01:01 Ron Johnson &lt;<a href=3D"mailto:ronljohnsonjr@gmail.com">=
ronljohnsonjr@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex"><div dir=3D"ltr">That&#39;s an unanswerable question, as I would not us=
e Windows.=C2=A0=C2=A0=F0=9F=98=81<div><br></div><div>Seriously though, sin=
ce it&#39;s an image-heavy database full of PDF and TIFF files, I&#39;d do =
what I did on Linux when needing to migrate/upgrade a 6TB (including indice=
s) db from PG 9.6 to PG 14, and took four hours:</div><div>pg_dump -Z1 --jo=
bs=3D16</div><div><br></div></div><br><div class=3D"gmail_quote"><div dir=
=3D"ltr" class=3D"gmail_attr">On Fri, May 30, 2025 at 2:39=E2=80=AFPM Andy =
Hartman &lt;<a href=3D"mailto:hartman60home@gmail.com" target=3D"_blank" re=
l=3D"noreferrer">hartman60home@gmail.com</a>&gt; wrote:<br></div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px s=
olid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">What would you use=
 for backup if PG hosted on Windows</div><br><div class=3D"gmail_quote"><di=
v dir=3D"ltr" class=3D"gmail_attr">On Fri, May 30, 2025 at 2:10=E2=80=AFPM =
Ron Johnson &lt;<a href=3D"mailto:ronljohnsonjr@gmail.com" target=3D"_blank=
" rel=3D"noreferrer">ronljohnsonjr@gmail.com</a>&gt; wrote:<br></div><block=
quote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1=
px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div>Hmm... th=
at was a few years ago, back when v12 was new.=C2=A0 It took about=C2=A0 a =
month (mainly because they didn&#39;t want=C2=A0me running exports during &=
quot;office hours&quot;).</div><div><br></div><div>There were 120 INSERT &a=
mp; SELECT (no UPDATE or DELETE) tables, so I was able to add indices on da=
te columns, create by-month views.=C2=A0 (We migrated the dozen or so <i>re=
latively</i> small UPDATE tables on cut-over day.=C2=A0 On that same day, I=
 migrated the current month and the previous month&#39;s data in those 120 =
tables.</div><div><br></div><div>I made separate cron jobs to:</div><div>- =
export views from Oracle into COPY-style tab-separated flat files,=C2=A0</d=
iv><div>- lz4-compress views that had finished exporting, and</div><div>- s=
cp files that were finished compressing, to an AWS EC2 VM.</div><div><br></=
div><div>These jobs pipelined, so there was always a job exporting, always =
a job ready to compress tsv files, and another job ready to scp the lz4 fil=
es.=C2=A0 When there was nothing for a step to do, the job would sleep for =
a couple of minutes, then check if there was more work to do.</div><div><br=
></div><div>On the AWS EC2 VM, a different cron job waited for files to fin=
ish transferring, then loaded them into the correct table. Just like with t=
he source host jobs, the &quot;load&quot; job would sleep a bit and then ch=
eck for more work. I manually applied Indices.</div><div><br></div><div>The=
 AWS RDS PG12 database was about 4TB.=C2=A0 Snapshots were handled by AWS.=
=C2=A0 If this had been one of my on-prem systems, I&#39;d have used pgback=
rest.=C2=A0 (pgbackrest is impressively fast: takes good advantage of PG=
9;s 1GB file max, and globs &quot;small&quot; files into one big file.)</di=
v><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On F=
ri, May 30, 2025 at 12:15=E2=80=AFPM Andy Hartman &lt;<a href=3D"mailto:har=
tman60home@gmail.com" target=3D"_blank" rel=3D"noreferrer">hartman60home@gm=
ail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-lef=
t:1ex"><div dir=3D"ltr">what was the duration start to finish of the migrat=
ion of the 6tb of data. then what do you use for a quick backup after archi=
ved PG data=C2=A0<br><br>Thanks.</div><br><div class=3D"gmail_quote"><div d=
ir=3D"ltr" class=3D"gmail_attr">On Fri, May 30, 2025 at 11:29=E2=80=AFAM Ro=
n Johnson &lt;<a href=3D"mailto:ronljohnsonjr@gmail.com" target=3D"_blank" =
rel=3D"noreferrer">ronljohnsonjr@gmail.com</a>&gt; wrote:<br></div><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px=
 solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div dir=3D"ltr"=
>On Fri, May 30, 2025 at 3:51=E2=80=AFAM Motog Plus &lt;<a href=3D"mailto:m=
plus7535@gmail.com" target=3D"_blank" rel=3D"noreferrer">mplus7535@gmail.co=
m</a>&gt; wrote:</div><div class=3D"gmail_quote"><blockquote class=3D"gmail=
_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204=
,204);padding-left:1ex"><div dir=3D"auto">Hi Team,<div dir=3D"auto"><br></d=
iv><div dir=3D"auto">We are currently planning a data archival initiative f=
or our production PostgreSQL databases and would appreciate suggestions or =
insights from the community regarding best practices and proven approaches.=
</div><div dir=3D"auto"><br></div><div dir=3D"auto">**Scenario:**</div><div=
 dir=3D"auto">- We have a few large tables (several hundred million rows) w=
here we want to archive historical data (e.g., older than 1 year).</div><di=
v dir=3D"auto">- The archived data should be moved to a separate PostgreSQL=
 database (on a same or different server).</div><div dir=3D"auto">- Our goa=
ls are: efficient data movement, minimal downtime, and safe deletion from t=
he source after successful archival.</div><div dir=3D"auto"><br></div><div =
dir=3D"auto">- PostgreSQL version: 15.12</div><div dir=3D"auto">- Both sour=
ce and target databases are PostgreSQL.</div><div dir=3D"auto"><br></div><d=
iv dir=3D"auto">We explored using `COPY TO` and `COPY FROM` with CSV files,=
 uploaded to a SharePoint or similar storage system. However, our infrastru=
cture team raised concerns around the computational load of large CSV proce=
ssing and potential security implications with file transfers.</div><div di=
r=3D"auto"><br></div><div dir=3D"auto">We=E2=80=99d like to understand:</di=
v><div dir=3D"auto">- What approaches have worked well for you in practice?=
</div></div></blockquote><div><br></div><div>This is how I migrated 6TB of =
data from an Oracle database to Postgresql, and then implemented quarterly =
archiving of the PG database:</div><div>- COPY FROM (SELECT * FROM live_tab=
le WHERE date_fld in some_manageable_date_range) TO STDOUT.</div><div>- Com=
press</div><div>- scp</div><div>- COPY TO archive_table.</div><div>- Index<=
/div><div>- DELETE FROM live_table WHERE date_fld in some_manageable_date_r=
ange=C2=A0 (This I only did in the PG archive process</div><div>=C2=A0</div=
><div>(Naturally, the Oracle migration used Oracle-specific commands.)</div=
><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0=
px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=
=3D"auto"><div dir=3D"auto">- Are there specific tools or strategies you=E2=
=80=99d recommend for ongoing archival?</div></div></blockquote><div><br></=
div><div>I write generic bash loops to which you pass an array that contain=
s the table name,=C2=A0PK,=C2=A0date column and date range.</div><div><br><=
/div><div>Given a list of tables, it did the COPY FROM, lz4 and scp.=C2=A0 =
Once that finished successfully, another script dropped=C2=A0archive indice=
s on the current table, COPY TO and CREATE INDEX statements.=C2=A0 A third =
script did the deletes.</div><div><br></div><div>This works even when the l=
ive database tables are all connected via FK.=C2=A0 You just need to carefu=
lly order the tables in your script.</div><div>=C2=A0</div><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid r=
gb(204,204,204);padding-left:1ex"><div dir=3D"auto"><div dir=3D"auto">- Any=
 performance or consistency issues we should watch out for?</div></div></bl=
ockquote><div><br></div><div>My rules=C2=A0for=C2=A0scripting are &quot;bit=
e-sized pieces&quot; and &quot;check those return codes!&quot;.</div><div>=
=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0=
.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"a=
uto"><div dir=3D"auto">Your insights or any relevant documentation/pointers=
 would be immensely helpful.</div></div></blockquote><div>=C2=A0</div><div>=
Index support uber alles.=C2=A0 When deleting from a table which relies on =
a foreign key link to a table which _does_ have a date field, don&#39;t hes=
itate to join on that table.</div><div><br></div><div>And DELETE of bite-si=
zed chunks is faster than people give it credit for.</div><div><br></div></=
div><span class=3D"gmail_signature_prefix">-- </span><br><div dir=3D"ltr" c=
lass=3D"gmail_signature"><div dir=3D"ltr">Death to &lt;Redacted&gt;, and bu=
tter sauce.<div>Don&#39;t boil me, I&#39;m still alive.<br><div><div>&lt;Re=
dacted&gt; lobster!</div></div></div></div></div></div>
</blockquote></div>
</blockquote></div><div><br clear=3D"all"></div><div><br></div><span class=
=3D"gmail_signature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_s=
ignature"><div dir=3D"ltr">Death to &lt;Redacted&gt;, and butter sauce.<div=
>Don&#39;t boil me, I&#39;m still alive.<br><div><div>&lt;Redacted&gt; lobs=
ter!</div></div></div></div></div></div>
</blockquote></div>
</blockquote></div><div><br clear=3D"all"></div><div><br></div><span class=
=3D"gmail_signature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_s=
ignature"><div dir=3D"ltr">Death to &lt;Redacted&gt;, and butter sauce.<div=
>Don&#39;t boil me, I&#39;m still alive.<br><div><div>&lt;Redacted&gt; lobs=
ter!</div></div></div></div></div>
</blockquote></div>

--0000000000002492080636964816--