Re: Index corruption issue after migration from RHEL 7 to RHEL 9 (PostgreSQL 11 streaming replication)

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Index corruption issue after migration from RHEL 7 to RHEL 9 (PostgreSQL 11 streaming replication)
6+ messages / 2 participants
[nested] [flat]

* Re: Index corruption issue after migration from RHEL 7 to RHEL 9 (PostgreSQL 11 streaming replication)
@ 2025-11-05 06:27  Bala M <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Bala M @ 2025-11-05 06:27 UTC (permalink / raw)
  To: Greg Sabino Mullane <[email protected]>; Francisco Olarte <[email protected]>; +Cc: [email protected] <[email protected]>; [email protected]; [email protected]

Thank you all for your suggestions,

Thanks for your quick response and for sharing the details.
After reviewing the options, the logical replication approach seems to be
the most feasible one with minimal downtime.

However, we currently have 7 streaming replication setups running from
production, with a total database size of around 15 TB. Out of this, there
are about 10 large tables ranging from 1 TB (max) to 50 GB (min) each,
along with approximately 150+ sequences.

Could you please confirm if there are any successful case studies or
benchmarks available for a similar setup?
Additionally, please share any recommended parameter tuning or best
practices for handling logical replication at this scale.

Current server configuration:

CPU: 144 cores

RAM: 512 GB

Thanks & Regards
Krishna.

On Fri, 24 Oct 2025 at 21:55, Francisco Olarte <[email protected]>
wrote:

>
> On Thu, 23 Oct 2025 at 17:21, Greg Sabino Mullane <[email protected]>
> wrote
>
> pg_dump is the most reliable, and the slowest. Keep in mind that only the
>> actual data needs to move over (not the indexes, which get rebuilt after
>> the data is loaded). You could also mix-n-match pg_logical and pg_dump if
>> you have a few tables that are super large. Whether either approach fits in
>> your 24 hour window is hard to say without you running some tests.
>>
>
> Long time ago I had a similar problem and did a "running with scissors"
> restore. This means:
>
> 1.- Prepare normal configuration, test, etc for the new version.
> 2.- Prepare a restore configuration, with fsync=off, wallevel=minimal,
> whatever option gives you any speed advantage.
>
> As the target was empty, if restore failed we could just clean and restart.
>
> 3.- Dump, boot with the restore configuration, restore, clean shutdown,
> switch to production configuration, boot again and follow on.
>
> Time has passed and I lost my notes, but I remember the restore was much
> faster than doing it with the normal production configuration. Given
> current machine speeds, it maybe doable.
>
>
> Francisco Olarte.
>
>

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Index corruption issue after migration from RHEL 7 to RHEL 9 (PostgreSQL 11 streaming replication)
@ 2025-11-05 15:36  Adrian Klaver <[email protected]>
  parent: Bala M <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Adrian Klaver @ 2025-11-05 15:36 UTC (permalink / raw)
  To: Bala M <[email protected]>; Greg Sabino Mullane <[email protected]>; Francisco Olarte <[email protected]>; +Cc: [email protected]; [email protected]

On 11/4/25 22:27, Bala M wrote:
> Thank you all for your suggestions,
> 
> Thanks for your quick response and for sharing the details.
> After reviewing the options, the logical replication approach seems to 
> be the most feasible one with minimal downtime.
> 
> However, we currently have 7 streaming replication setups running from 
> production, with a total database size of around 15 TB. Out of this, 
> there are about 10 large tables ranging from 1 TB (max) to 50 GB (min) 
> each, along with approximately 150+ sequences.
> 
> Could you please confirm if there are any successful case studies or 
> benchmarks available for a similar setup?

Since you have given minimal information in this post, I doubt there is 
really a way to compare to other situations. Collect the details you 
provided earlier in the thread for those folks getting to it just now.

That would be:

1) Postgres versions on both ends

2) OS and versions on both ends.

3) Network distance between 'machines'.

4) The logical replication settings.

> Additionally, please share any recommended parameter tuning or best 
> practices for handling logical replication at this scale.

Since you have already started is that not already to late for this?



> 
> Current server configuration:
> 
> CPU: 144 cores
> 
> RAM: 512 GB
> 
> 
> Thanks & Regards
> Krishna.
> 



-- 
Adrian Klaver
[email protected]






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Index corruption issue after migration from RHEL 7 to RHEL 9 (PostgreSQL 11 streaming replication)
@ 2025-11-06 17:04  Bala M <[email protected]>
  parent: Adrian Klaver <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Bala M @ 2025-11-06 17:04 UTC (permalink / raw)
  To: Adrian Klaver <[email protected]>; +Cc: Greg Sabino Mullane <[email protected]>; Francisco Olarte <[email protected]>; [email protected]; [email protected]

Hi  Adrian, Thank you for your response. Please find the requested details
below:

*PostgreSQL Version:*

Source: PostgreSQL 11.15

Target: PostgreSQL 16.9

*Operating System:*

Source: RHEL 7.9

Target: RHEL 9.6

*Network Distance:*

Both servers are in the same data center, connected through a high-speed
internal network (low latency).

Logical Replication Settings:

*Source - Postgres 11.15.*

-- ==== WAL & Replication Settings ====

 wal_level = 'logical'
 max_wal_senders = '30'
 max_replication_slots = '20'
 wal_keep_segments = '800'
 wal_sender_timeout = '300s'
 max_worker_processes = '32'
 max_logical_replication_workers = '16'
 max_sync_workers_per_subscription = '8'

 ==== WAL & Checkpoint  ====

 max_wal_size = '40GB'
 min_wal_size = '4GB'
 checkpoint_timeout = '45min'
 checkpoint_completion_target = '0.9'

 ====  Memory ====
 shared_buffers = '18GB'
 work_mem = '128MB'
 maintenance_work_mem = 4GB'
 effective_cache_size = '275GB'


*Target DB Postgres 16.10*

 ==== Logical Replication Settings  ====
 max_worker_processes = '32'
 max_logical_replication_workers = '16'
 max_sync_workers_per_subscription = '8'
 wal_receiver_timeout = '300s'

 ==== WAL & Checkpoint  ====

 checkpoint_timeout = '45min'
 checkpoint_completion_target = '0.9'
 max_wal_size = '40GB'
 min_wal_size = '4GB'

 ====  Memory  ====
 shared_buffers = '18GB'
 work_mem = '128MB'
 maintenance_work_mem = '4GB'
 effective_cache_size = '275GB'
 synchronous_commit = 'off'


Since you have already started is that not already to late for this?

Yes We are currently in the *testing phase* and validating with the above
parameters. However, the replication process has been *extremely slow —
it’s been running for the past 5 days* with limited progress.

Any specific tuning recommendations or best practices to improve
performance at this stage would be greatly appreciated.

Thanks & Regards
Krishna.


On Wed, 5 Nov 2025 at 21:07, Adrian Klaver <[email protected]>
wrote:

> On 11/4/25 22:27, Bala M wrote:
> > Thank you all for your suggestions,
> >
> > Thanks for your quick response and for sharing the details.
> > After reviewing the options, the logical replication approach seems to
> > be the most feasible one with minimal downtime.
> >
> > However, we currently have 7 streaming replication setups running from
> > production, with a total database size of around 15 TB. Out of this,
> > there are about 10 large tables ranging from 1 TB (max) to 50 GB (min)
> > each, along with approximately 150+ sequences.
> >
> > Could you please confirm if there are any successful case studies or
> > benchmarks available for a similar setup?
>
> Since you have given minimal information in this post, I doubt there is
> really a way to compare to other situations. Collect the details you
> provided earlier in the thread for those folks getting to it just now.
>
> That would be:
>
> 1) Postgres versions on both ends
>
> 2) OS and versions on both ends.
>
> 3) Network distance between 'machines'.
>
> 4) The logical replication settings.
>
> > Additionally, please share any recommended parameter tuning or best
> > practices for handling logical replication at this scale.
>
> Since you have already started is that not already to late for this?
>
>
>
> >
> > Current server configuration:
> >
> > CPU: 144 cores
> >
> > RAM: 512 GB
> >
> >
> > Thanks & Regards
> > Krishna.
> >
>
>
>
> --
> Adrian Klaver
> [email protected]
>


^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Index corruption issue after migration from RHEL 7 to RHEL 9 (PostgreSQL 11 streaming replication)
@ 2025-11-06 18:23  Adrian Klaver <[email protected]>
  parent: Bala M <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Adrian Klaver @ 2025-11-06 18:23 UTC (permalink / raw)
  To: Bala M <[email protected]>; +Cc: Greg Sabino Mullane <[email protected]>; Francisco Olarte <[email protected]>; [email protected]; [email protected]

On 11/6/25 09:04, Bala M wrote:
> 
> Hi  Adrian, Thank you for your response. Please find the requested 
> details below:
> 

>     Since you have already started is that not already to late for this?
> 
> Yes We are currently in the *testing phase* and validating with the 
> above parameters. However, the replication process has been *extremely 
> slow — it’s been running for the past 5 days* with limited progress.

1) Show how you have set up the publishers and subscribers.

2) Are you trying to replicate everything at once or a table or subset 
of tables at a time?

3) How active is the source database?

> 
> Any specific tuning recommendations or best practices to improve 
> performance at this stage would be greatly appreciated.
> 
> 
> Thanks & Regards
> Krishna.
> 
> 

-- 
Adrian Klaver
[email protected]






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Index corruption issue after migration from RHEL 7 to RHEL 9 (PostgreSQL 11 streaming replication)
@ 2025-11-07 00:30  Bala M <[email protected]>
  parent: Adrian Klaver <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Bala M @ 2025-11-07 00:30 UTC (permalink / raw)
  To: Adrian Klaver <[email protected]>; +Cc: Greg Sabino Mullane <[email protected]>; Francisco Olarte <[email protected]>; [email protected]; [email protected]

*1) Show how you have set up the publishers and subscribers.*

We have configured logical replication between PostgreSQL 11 (publisher)
and PostgreSQL 16 (subscriber).

Publisher: PostgreSQL 11 with wal_level=logical, max_replication_slots=20,
and publication created for all tables (FOR ALL TABLES).

Subscriber: PostgreSQL 16 with max_logical_replication_workers=16,
max_sync_workers_per_subscription=8, connected through a subscription to
replicate all tables.

 select * from pg_publication;
  pubname  | pubowner | puballtables | pubinsert | pubupdate | pubdelete |
pubtruncate
-----------+----------+--------------+-----------+-----------+-----------+-------------
 test_v_pub |       10 | t            | t         | t         | t         |
t
(1 row)

  select * from pg_replication_slots ;
                 slot_name                 |  plugin  | slot_type | datoid
| database | temporary | active | active_pid | xmin | catalog_xmin |
 restart_lsn   | confirmed_flush_lsn
-------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+----------------+---------------------
 vahan_sub                                 | pgoutput | logical   |  16468
| test_v     | f         | t      |    3722767 |      |   3146687395 |
20538/7D039948 | 20538/7D045790
 pg_120870_sync_94301_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/71C4FD28 | 20538/71C4FD60
 pg_120870_sync_93499_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/7D039948 | 20538/7D039980
 pg_120870_sync_99121_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/6F290C88 | 20538/6F290CC0
 pg_120870_sync_108854_7567666078127625550 | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/7C6B5FD0 | 20538/7C6B6020
 pg_120870_sync_93451_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/7D0359A8 | 20538/7D0359E0
 pg_120870_sync_94589_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/4C0001E8 | 20538/4C000220
 pg_120870_sync_98225_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/6F290DA0 | 20538/6F290DD8
 pg_120870_sync_96438_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/75CE4120 | 20538/75CE4158
 pg_120870_sync_93469_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/7D00E0B0 | 20538/7D00E0E8
 pg_120870_sync_94277_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/4C000178 | 20538/4C0001B0
 pg_120870_sync_109085_7567666078127625550 | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/7B72CFE8 | 20538/7B72D020
 pg_120870_sync_109110_7567666078127625550 | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/7AE3B078 | 20538/7AE3B0B0
 pg_120870_sync_99247_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/4C0001B0 | 20538/4C0001E8
 pg_120870_sync_103454_7567666078127625550 | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/7D0002F0 | 20538/7D000328
 pg_120870_sync_96317_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/71C59D20 | 20538/71C59D58
 pg_120870_sync_107790_7567666078127625550 | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/7A4102F0 | 20538/7A410328
 pg_120870_sync_98730_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/6F00CE40 | 20538/6F00CE78
 pg_120870_sync_93488_7567666078127625550  | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/7CF2C4F0 | 20538/7CF2C528
 pg_120870_sync_103457_7567666078127625550 | pgoutput | logical   |  16468
| test_v     | f         | f      |            |      |   3146687395 |
20538/6F127188 | 20538/6F1271C0
(20 rows)

postgres=# select * from pg_subscription;
-[ RECORD 1
]-------+---------------------------------------------------------------------
oid                 | 120870
subdbid             | 91683
subskiplsn          | 0/0
subname             | test_v_sub
subowner            | 10
subenabled          | t
subbinary           | f
substream           | f
subtwophasestate    | d
subdisableonerr     | f
subpasswordrequired | t
subrunasowner       | f
subconninfo         | host=xyz port=5432 user=postgres password=test
dbname=test_v
subslotname         | test_v_sub
subsynccommit       | off
subpublications     | {test_v_pub}
suborigin           | any


*2) Are you trying to replicate everything at once or a table or subsetof
tables at a time?*

We are replicating all tables at once (full database-level replication),
not table-by-table.

*3) How active is the source database?*
The source database is highly active, with continuous transactions and
heavy write activity across large tables.


Thanks & Regards
Krishna.


On Thu, 6 Nov 2025 at 23:53, Adrian Klaver <[email protected]>
wrote:

> On 11/6/25 09:04, Bala M wrote:
> >
> > Hi  Adrian, Thank you for your response. Please find the requested
> > details below:
> >
>
> >     Since you have already started is that not already to late for this?
> >
> > Yes We are currently in the *testing phase* and validating with the
> > above parameters. However, the replication process has been *extremely
> > slow — it’s been running for the past 5 days* with limited progress.
>
> 1) Show how you have set up the publishers and subscribers.
>
> 2) Are you trying to replicate everything at once or a table or subset
> of tables at a time?
>
> 3) How active is the source database?
>
> >
> > Any specific tuning recommendations or best practices to improve
> > performance at this stage would be greatly appreciated.
> >
> >
> > Thanks & Regards
> > Krishna.
> >
> >
>
> --
> Adrian Klaver
> [email protected]
>


^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Index corruption issue after migration from RHEL 7 to RHEL 9 (PostgreSQL 11 streaming replication)
@ 2025-11-08 16:34  Adrian Klaver <[email protected]>
  parent: Bala M <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: Adrian Klaver @ 2025-11-08 16:34 UTC (permalink / raw)
  To: Bala M <[email protected]>; +Cc: Greg Sabino Mullane <[email protected]>; Francisco Olarte <[email protected]>; [email protected]; [email protected]

On 11/6/25 16:30, Bala M wrote:
> *
> *
> *1) Show how you have set up the publishers and subscribers.*
> 
> We have configured logical replication between PostgreSQL 11 (publisher) 
> and PostgreSQL 16 (subscriber).
> 
> Publisher: PostgreSQL 11 with wal_level=logical, 
> max_replication_slots=20, and publication created for all tables (FOR 
> ALL TABLES).
> 

I have not worked with a database of this size, still I would think FOR
ALL TABLES is not the way to go. That it along with the existing 
activity on the publisher would result in a slow transfer of data.

-- 
Adrian Klaver
[email protected]






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2025-11-08 16:34 UTC | newest]

Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-11-05 06:27 Re: Index corruption issue after migration from RHEL 7 to RHEL 9 (PostgreSQL 11 streaming replication) Bala M <[email protected]>
2025-11-05 15:36 ` Adrian Klaver <[email protected]>
2025-11-06 17:04   ` Bala M <[email protected]>
2025-11-06 18:23     ` Adrian Klaver <[email protected]>
2025-11-07 00:30       ` Bala M <[email protected]>
2025-11-08 16:34         ` Adrian Klaver <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox