Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uM4hg-001ypS-M0 for pgsql-admin@arkaria.postgresql.org; Mon, 02 Jun 2025 12:55:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uM4hf-009jWv-De for pgsql-admin@arkaria.postgresql.org; Mon, 02 Jun 2025 12:55:47 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uM4he-009jWn-NT for pgsql-admin@lists.postgresql.org; Mon, 02 Jun 2025 12:55:47 +0000 Received: from mail-pj1-x1036.google.com ([2607:f8b0:4864:20::1036]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1uM4hb-00027A-3C for pgsql-admin@lists.postgresql.org; Mon, 02 Jun 2025 12:55:45 +0000 Received: by mail-pj1-x1036.google.com with SMTP id 98e67ed59e1d1-306b6ae4fb2so3628997a91.3 for ; Mon, 02 Jun 2025 05:55:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748868944; x=1749473744; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=3pejgVKN8uopPP/1rQXiDOqDs66fL8XOuZSkAWLLYa4=; b=K53lFtqXdDEKjfor+iGckgQ/NVzhe+Vbe9XeKVc58BsumB0xB1nbTSe/Wc3q1IyGe+ Wefv0F5PYQS+sVeMV3bQXZ5Lamq5DQvJ/kR0GoIrIyUHboKtv1lUULuzM2NETTyMpQZa NyYmf01JTA605LXNKtGvaOlODpCTbS5tumBI2YB/ni2VPLjddBq+nEgE4eRvaAzbtFhd 7Uaw2fwMlXdIumBcbEruJLyw/3XI66yTmZAgtSpsS0FY2HKbF4JYUDlHrPDzfRIFaCgU WEQLdywUOVLqyq+KgRl09ubOBEtnnuTJmc9Sd2HnOr3qrvBa62xneOCw0s6HrCaeYOK/ 4inA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748868944; x=1749473744; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=3pejgVKN8uopPP/1rQXiDOqDs66fL8XOuZSkAWLLYa4=; b=GYXKKxz3aF/I9s7p587WrzXekBWVCrqRukkIPWhU8j45TOlrnKn/0ticce/y/t2cDG xWleS4uZp1rQHwmvajqC33HdMLtWZHHElOr0gl5hTHmo54An48idUpU83MRPaZFNpWu+ SEyhaWO59WEJan4iWQmlmAx9MINofdn57CY3M4Eqn41JVIxlHlm0BzzMYI4TE/g1llT8 EWdO0dQFQj7dk/pBlWltlzjRndOR7RbB5ACGzBWWjUOaIILX87pg+FbCJBr8F9fys+IQ MENEri2XEGxCcZpnZQZzzKtyFctTBkWVuZ5Otb6CLiBNf3z+3GSYIj/nbP/yOIYz45yF nZXw== X-Forwarded-Encrypted: i=1; AJvYcCWNNdw6dQ6XrooU/juM5tIHmPBIct9/tN3dZA8J7j6hns21klHoheUaNcP+5qbQ13IIj0FGs7Y1hVb/Qg==@lists.postgresql.org X-Gm-Message-State: AOJu0YzNRCaaJjRgA1MOxDbt702o4plLWQA6Zh9Iw2+c/NwSqeXcU89X /Kwg3uFmwt6twH37xWYI1hETtd+Th5z9dFXwrdMfUO+lOTYH5dW4F63Adaz8H6s8touGaYZV1ef LwDNJNMkqVI6wr66jcZHIyjaiYZCxjEI= X-Gm-Gg: ASbGncshyNE7/FJwnSC4s7PYIXDR0VJR56Zv73NENwXQt/uIUaVxEKcSsg0aqI4FWYo ySVjNaj2wTrRV4BXo/g521+r1HIV3udQsf5rhLymRs0zpC6/xQ82I0us4oZOXcWpB2deVaVTIFv bF0bN0FSbG2BosdUD1UxBPzyTL5O5SfTwJUznyU/JgLPtPEw== X-Google-Smtp-Source: AGHT+IEAxXsNn5EumrL7nfZbIFl9+DI4xlATVcMA4rwr1DV5vbbhk6BszyfH88MwXAZL2XDDc9l9kn06K4O5gre49Rg= X-Received: by 2002:a17:90b:4a0d:b0:311:ff02:3fd6 with SMTP id 98e67ed59e1d1-3127c6bb367mr14337936a91.12.1748868943612; Mon, 02 Jun 2025 05:55:43 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Motog Plus Date: Mon, 2 Jun 2025 18:25:32 +0530 X-Gm-Features: AX0GCFt0nyMI1YPFuVdUb4dDa6ANMMNEnyHVWzbphNZ4aWKxIVPTndYZJ2szh-Y Message-ID: Subject: Re: Seeking Suggestions for Best Practices: Archiving and Migrating Historical Data in PostgreSQL To: Ron Johnson , Pgsql-admin Cc: Andy Hartman Content-Type: multipart/alternative; boundary="0000000000002492080636964816" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000002492080636964816 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Everyone, Thank you all for the helpful suggestions, insights, and follow-up questions. I truly appreciate the time and effort you=E2=80=99ve taken to s= hare your experiences and recommendations. To answer one of the common questions: **yes, we are using partitioned tables**, primarily based on a timestamp column. This setup is already helping us manage and isolate historical data more effectively. The input from this community has been incredibly valuable in helping us shape our archival approach. We=E2=80=99re currently evaluating a few optio= ns based on your feedback and will proceed with a solution that best balances efficiency, reliability, and security. We may reach out again with more specific questions or for further suggestions once we finalize the approach and start implementation. Thanks again for your support! Best regards, Ramzy On Sat, May 31, 2025, 01:01 Ron Johnson wrote: > That's an unanswerable question, as I would not use Windows. =F0=9F=98= =81 > > Seriously though, since it's an image-heavy database full of PDF and TIFF > files, I'd do what I did on Linux when needing to migrate/upgrade a 6TB > (including indices) db from PG 9.6 to PG 14, and took four hours: > pg_dump -Z1 --jobs=3D16 > > > On Fri, May 30, 2025 at 2:39=E2=80=AFPM Andy Hartman > wrote: > >> What would you use for backup if PG hosted on Windows >> >> On Fri, May 30, 2025 at 2:10=E2=80=AFPM Ron Johnson >> wrote: >> >>> Hmm... that was a few years ago, back when v12 was new. It took about >>> a month (mainly because they didn't want me running exports during "off= ice >>> hours"). >>> >>> There were 120 INSERT & SELECT (no UPDATE or DELETE) tables, so I was >>> able to add indices on date columns, create by-month views. (We migrat= ed >>> the dozen or so *relatively* small UPDATE tables on cut-over day. On >>> that same day, I migrated the current month and the previous month's da= ta >>> in those 120 tables. >>> >>> I made separate cron jobs to: >>> - export views from Oracle into COPY-style tab-separated flat files, >>> - lz4-compress views that had finished exporting, and >>> - scp files that were finished compressing, to an AWS EC2 VM. >>> >>> These jobs pipelined, so there was always a job exporting, always a job >>> ready to compress tsv files, and another job ready to scp the lz4 files= . >>> When there was nothing for a step to do, the job would sleep for a coup= le >>> of minutes, then check if there was more work to do. >>> >>> On the AWS EC2 VM, a different cron job waited for files to finish >>> transferring, then loaded them into the correct table. Just like with t= he >>> source host jobs, the "load" job would sleep a bit and then check for m= ore >>> work. I manually applied Indices. >>> >>> The AWS RDS PG12 database was about 4TB. Snapshots were handled by >>> AWS. If this had been one of my on-prem systems, I'd have used >>> pgbackrest. (pgbackrest is impressively fast: takes good advantage of = PG's >>> 1GB file max, and globs "small" files into one big file.) >>> >>> On Fri, May 30, 2025 at 12:15=E2=80=AFPM Andy Hartman >>> wrote: >>> >>>> what was the duration start to finish of the migration of the 6tb of >>>> data. then what do you use for a quick backup after archived PG data >>>> >>>> Thanks. >>>> >>>> On Fri, May 30, 2025 at 11:29=E2=80=AFAM Ron Johnson >>>> wrote: >>>> >>>>> On Fri, May 30, 2025 at 3:51=E2=80=AFAM Motog Plus >>>>> wrote: >>>>> >>>>>> Hi Team, >>>>>> >>>>>> We are currently planning a data archival initiative for our >>>>>> production PostgreSQL databases and would appreciate suggestions or >>>>>> insights from the community regarding best practices and proven appr= oaches. >>>>>> >>>>>> **Scenario:** >>>>>> - We have a few large tables (several hundred million rows) where we >>>>>> want to archive historical data (e.g., older than 1 year). >>>>>> - The archived data should be moved to a separate PostgreSQL databas= e >>>>>> (on a same or different server). >>>>>> - Our goals are: efficient data movement, minimal downtime, and safe >>>>>> deletion from the source after successful archival. >>>>>> >>>>>> - PostgreSQL version: 15.12 >>>>>> - Both source and target databases are PostgreSQL. >>>>>> >>>>>> We explored using `COPY TO` and `COPY FROM` with CSV files, uploaded >>>>>> to a SharePoint or similar storage system. However, our infrastructu= re team >>>>>> raised concerns around the computational load of large CSV processin= g and >>>>>> potential security implications with file transfers. >>>>>> >>>>>> We=E2=80=99d like to understand: >>>>>> - What approaches have worked well for you in practice? >>>>>> >>>>> >>>>> This is how I migrated 6TB of data from an Oracle database to >>>>> Postgresql, and then implemented quarterly archiving of the PG databa= se: >>>>> - COPY FROM (SELECT * FROM live_table WHERE date_fld in >>>>> some_manageable_date_range) TO STDOUT. >>>>> - Compress >>>>> - scp >>>>> - COPY TO archive_table. >>>>> - Index >>>>> - DELETE FROM live_table WHERE date_fld in some_manageable_date_range >>>>> (This I only did in the PG archive process >>>>> >>>>> (Naturally, the Oracle migration used Oracle-specific commands.) >>>>> >>>>> - Are there specific tools or strategies you=E2=80=99d recommend for = ongoing >>>>>> archival? >>>>>> >>>>> >>>>> I write generic bash loops to which you pass an array that contains >>>>> the table name, PK, date column and date range. >>>>> >>>>> Given a list of tables, it did the COPY FROM, lz4 and scp. Once that >>>>> finished successfully, another script dropped archive indices on the >>>>> current table, COPY TO and CREATE INDEX statements. A third script d= id the >>>>> deletes. >>>>> >>>>> This works even when the live database tables are all connected via >>>>> FK. You just need to carefully order the tables in your script. >>>>> >>>>> >>>>>> - Any performance or consistency issues we should watch out for? >>>>>> >>>>> >>>>> My rules for scripting are "bite-sized pieces" and "check those retur= n >>>>> codes!". >>>>> >>>>> >>>>>> Your insights or any relevant documentation/pointers would be >>>>>> immensely helpful. >>>>>> >>>>> >>>>> Index support uber alles. When deleting from a table which relies on >>>>> a foreign key link to a table which _does_ have a date field, don't >>>>> hesitate to join on that table. >>>>> >>>>> And DELETE of bite-sized chunks is faster than people give it credit >>>>> for. >>>>> >>>>> -- >>>>> Death to , and butter sauce. >>>>> Don't boil me, I'm still alive. >>>>> lobster! >>>>> >>>> >>> >>> -- >>> Death to , and butter sauce. >>> Don't boil me, I'm still alive. >>> lobster! >>> >> > > -- > Death to , and butter sauce. > Don't boil me, I'm still alive. > lobster! > --0000000000002492080636964816 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Everyone,
Thank you all for the helpful suggestions, insight= s, and follow-up questions. I truly appreciate the time and effort you=E2= =80=99ve taken to share your experiences and recommendations.

To answer one of the common questions= : **yes, we are using partitioned tables**, primarily based on a timestamp = column. This setup is already helping us manage and isolate historical data= more effectively.

The i= nput from this community has been incredibly valuable in helping us shape o= ur archival approach. We=E2=80=99re currently evaluating a few options base= d on your feedback and will proceed with a solution that best balances effi= ciency, reliability, and security.

We may reach out again with more specific questions or for furth= er suggestions once we finalize the approach and start implementation.

Thanks again for your suppor= t!

Best regards,=C2=A0= =C2=A0
Ramzy

On Sat, May = 31, 2025, 01:01 Ron Johnson <= ronljohnsonjr@gmail.com> wrote:
That's an unanswerable question, as I would not us= e Windows.=C2=A0=C2=A0=F0=9F=98=81

Seriously though, sin= ce it's an image-heavy database full of PDF and TIFF files, I'd do = what I did on Linux when needing to migrate/upgrade a 6TB (including indice= s) db from PG 9.6 to PG 14, and took four hours:
pg_dump -Z1 --jo= bs=3D16


On Fri, May 30, 2025 at 2:39=E2=80=AFPM Andy = Hartman <hartman60home@gmail.com> wrote:
What would you use= for backup if PG hosted on Windows

On Fri, May 30, 2025 at 2:10=E2=80=AFPM = Ron Johnson <ronljohnsonjr@gmail.com> wrote:
Hmm... th= at was a few years ago, back when v12 was new.=C2=A0 It took about=C2=A0 a = month (mainly because they didn't want=C2=A0me running exports during &= quot;office hours").

There were 120 INSERT &a= mp; SELECT (no UPDATE or DELETE) tables, so I was able to add indices on da= te columns, create by-month views.=C2=A0 (We migrated the dozen or so re= latively small UPDATE tables on cut-over day.=C2=A0 On that same day, I= migrated the current month and the previous month's data in those 120 = tables.

I made separate cron jobs to:
- = export views from Oracle into COPY-style tab-separated flat files,=C2=A0
- lz4-compress views that had finished exporting, and
- s= cp files that were finished compressing, to an AWS EC2 VM.

These jobs pipelined, so there was always a job exporting, always = a job ready to compress tsv files, and another job ready to scp the lz4 fil= es.=C2=A0 When there was nothing for a step to do, the job would sleep for = a couple of minutes, then check if there was more work to do.
On the AWS EC2 VM, a different cron job waited for files to fin= ish transferring, then loaded them into the correct table. Just like with t= he source host jobs, the "load" job would sleep a bit and then ch= eck for more work. I manually applied Indices.

The= AWS RDS PG12 database was about 4TB.=C2=A0 Snapshots were handled by AWS.= =C2=A0 If this had been one of my on-prem systems, I'd have used pgback= rest.=C2=A0 (pgbackrest is impressively fast: takes good advantage of PG= 9;s 1GB file max, and globs "small" files into one big file.)
On F= ri, May 30, 2025 at 12:15=E2=80=AFPM Andy Hartman <hartman60home@gm= ail.com> wrote:
what was the duration start to finish of the migrat= ion of the 6tb of data. then what do you use for a quick backup after archi= ved PG data=C2=A0

Thanks.

On Fri, May 30, 2025 at 11:29=E2=80=AFAM Ro= n Johnson <ronljohnsonjr@gmail.com> wrote:
On Fri, May 30, 2025 at 3:51=E2=80=AFAM Motog Plus <mplus7535@gmail.co= m> wrote:
Hi Team,

We are currently planning a data archival initiative f= or our production PostgreSQL databases and would appreciate suggestions or = insights from the community regarding best practices and proven approaches.=

**Scenario:**
- We have a few large tables (several hundred million rows) w= here we want to archive historical data (e.g., older than 1 year).
- The archived data should be moved to a separate PostgreSQL= database (on a same or different server).
- Our goa= ls are: efficient data movement, minimal downtime, and safe deletion from t= he source after successful archival.

- PostgreSQL version: 15.12
- Both sour= ce and target databases are PostgreSQL.

We explored using `COPY TO` and `COPY FROM` with CSV files,= uploaded to a SharePoint or similar storage system. However, our infrastru= cture team raised concerns around the computational load of large CSV proce= ssing and potential security implications with file transfers.

We=E2=80=99d like to understand:
- What approaches have worked well for you in practice?=

This is how I migrated 6TB of = data from an Oracle database to Postgresql, and then implemented quarterly = archiving of the PG database:
- COPY FROM (SELECT * FROM live_tab= le WHERE date_fld in some_manageable_date_range) TO STDOUT.
- Com= press
- scp
- COPY TO archive_table.
- Index<= /div>
- DELETE FROM live_table WHERE date_fld in some_manageable_date_r= ange=C2=A0 (This I only did in the PG archive process
=C2=A0
(Naturally, the Oracle migration used Oracle-specific commands.)

- Are there specific tools or strategies you=E2= =80=99d recommend for ongoing archival?

I write generic bash loops to which you pass an array that contain= s the table name,=C2=A0PK,=C2=A0date column and date range.

<= /div>
Given a list of tables, it did the COPY FROM, lz4 and scp.=C2=A0 = Once that finished successfully, another script dropped=C2=A0archive indice= s on the current table, COPY TO and CREATE INDEX statements.=C2=A0 A third = script did the deletes.

This works even when the l= ive database tables are all connected via FK.=C2=A0 You just need to carefu= lly order the tables in your script.
=C2=A0
- Any= performance or consistency issues we should watch out for?

My rules=C2=A0for=C2=A0scripting are "bit= e-sized pieces" and "check those return codes!".
= =C2=A0
Your insights or any relevant documentation/pointers= would be immensely helpful.
=C2=A0
= Index support uber alles.=C2=A0 When deleting from a table which relies on = a foreign key link to a table which _does_ have a date field, don't hes= itate to join on that table.

And DELETE of bite-si= zed chunks is faster than people give it credit for.

--
Death to <Redacted>, and bu= tter sauce.
Don't boil me, I'm still alive.
<Re= dacted> lobster!


--
Death to <Redacted>, and butter sauce.Don't boil me, I'm still alive.
<Redacted> lobs= ter!


--
Death to <Redacted>, and butter sauce.Don't boil me, I'm still alive.
<Redacted> lobs= ter!
--0000000000002492080636964816--