Re: Return of the pg_wal issue..

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Saul Perdomo <[email protected]>
To: Paul Brindusa <[email protected]>
Cc: pgsql-general <[email protected]>
Subject: Re: Return of the pg_wal issue..
Date: Thu, 23 Jan 2025 09:26:51 -0500
Message-ID: <CAN3jBgFEX-fhXuNkrMYwCeWjtYK2_zSrvEmefkUZciLPHK7Psw@mail.gmail.com> (raw)
In-Reply-To: <CAFeSbqijFCW9xFOfapTzebbPcv2sWgpgrS1kVfFNJ+F7sA8R=A@mail.gmail.com>
References: <CAFeSbqh0Mj3bm9+aCaz5g4NhKn8+t4aGF=p5vOPc5oVssveATQ@mail.gmail.com>
	<[email protected]>
	<CAFeSbqijFCW9xFOfapTzebbPcv2sWgpgrS1kVfFNJ+F7sA8R=A@mail.gmail.com>

Hey Paul,

Regarding

*"I've not managed to test the queries out yet. But I am planning to test
out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases."*

As a dispassionate third-party observer, I can confirm that all SELECT and
SHOW queries from Laurenz's blog post are read-only. They're completely
safe to run in the affected environment.

On Thu, Jan 23, 2025 at 6:40 AM Paul Brindusa <[email protected]>
wrote:

> Hopefully the below is going to give a little bit more insight on the
> issue.
> I will mention as well that the cluster also replicates data to another
> mysql database if it's relevant at all.
> Also worth noting this is our production cluster and we have another
> pre-production cluster with basically the same settings and the issue there
> does not occur.
>
> A good deal more information is needed to troubleshoot this:
>
> 1) Postgres version(s).
>
> postgres (PostgreSQL) 15.10
>
> 2) The Patroni version.
>
> patroni 4.0.4
>
> 3) The Patroni configuration.
>
> scope: postgres-cluster
> name: db01
> namespace: /service/
>
> log:
>   level: INFO
>   traceback_level: ERROR
>   format: "%(asctime)s %(levelname)s: %(message)s"
>   dateformat: ""
>   max_queue_size: 1000
>   dir: /var/log/patroni
>   file_num: 4
>   file_size: 25000000
>   loggers:
>     patroni.postmaster: WARNING
>     urllib3: WARNING
>
> restapi:
>   listen: x.x.x.98:8008
>   connect_address: x.x.x.98:8008
>
> etcd3:
>   hosts: db01.local:2379,db02.local:2379,db03.local:2379
>
>
> bootstrap:
>   dcs:
>     ttl: 30
>     loop_wait: 10
>     retry_timeout: 10
>     maximum_lag_on_failover: 1048576
>     postgresql:
>       use_pg_rewind: true
>       use_slots: true
>       parameters:
>         max_connections: 500
>         superuser_reserved_connections: 5
>         password_encryption: scram-sha-256
>         max_locks_per_transaction: 512
>         max_prepared_transactions: 0
>         huge_pages: try
>         shared_buffers: 128MB
>         effective_cache_size: 4GB
>         work_mem: 128MB
>         maintenance_work_mem: 256MB
>         checkpoint_timeout: 15min
>         checkpoint_completion_target: 0.9
>         min_wal_size: 80MB
>         max_wal_size: 1GB
>         wal_buffers: 32MB
>         default_statistics_target: 1000
>         seq_page_cost: 1
>         random_page_cost: 4
>         effective_io_concurrency: 2
>         synchronous_commit: on
>         autovacuum: on
>         autovacuum_max_workers: 5
>         autovacuum_vacuum_scale_factor: 0.01
>         autovacuum_analyze_scale_factor: 0.01
>         autovacuum_vacuum_cost_limit: 500
>         autovacuum_vacuum_cost_delay: 2
>         autovacuum_naptime: 1s
>         max_files_per_process: 4096
>         archive_mode: on
>         archive_timeout: 1800s
>         archive_command: cd .
>         wal_level: replica
>         wal_keep_size: 2GB
>         max_wal_senders: 10
>         max_replication_slots: 10
>         hot_standby: on
>         wal_log_hints: on
>         wal_compression: on
>         shared_preload_libraries: pgaudit
>         track_io_timing: on
>         log_lock_waits: on
>         log_temp_files: 0
>         track_activities: on
>         track_counts: on
>         track_functions: all
>         log_checkpoints: on
>         logging_collector: on
>         log_truncate_on_rotation: on
>         log_rotation_age: 1d
>         log_rotation_size: 1GB
>         log_line_prefix: '%m [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h '
>         log_filename: postgresql-%Y-%m-%d.log
>         log_directory: /var/log/pgsql
>         log_connections: on
>         log_disconnections: on
>         log_statement: ddl
>         log_error_verbosity: verbose
>         hot_standby_feedback: on
>         max_standby_streaming_delay: 30s
>         wal_receiver_status_interval: 10s
>         idle_in_transaction_session_timeout: 10min
>         jit: off
>         max_worker_processes: 24
>         max_parallel_workers: 8
>         max_parallel_workers_per_gather: 2
>         max_parallel_maintenance_workers: 2
>
>   initdb:
>   - encoding: UTF8
>   - data-checksums
>
>   pg_hba:
>   - host replication replicator 127.0.0.1/32 md5
>
>   - host replication replicator x.x.x.98/27 scram-sha-256
>
>
>
>   - host replication replicator x.x.x.99/27 scram-sha-256
>
>
>
>   - host replication replicator x.x.x.100/27 scram-sha-256
>
>
>   - host all all 0.0.0.0/0 md5
>
> postgresql:
>   listen: x.x.x.98:5432
>   connect_address: x.x.x.98:5432
>   data_dir: /var/lib/pgsql/data
>   bin_dir: /usr/bin
>   pgpass: /var/lib/pgsql/.pgpass_patroni
>   authentication:
>     replication:
>       username: replicator
>       password: password
>     superuser:
>       username: postgres
>       password: password
>   parameters:
>     unix_socket_directories: /var/run/postgresql
>
>   remove_data_directory_on_rewind_failure: false
>   remove_data_directory_on_diverged_timelines: false
>
>   create_replica_methods:
>     - basebackup
>   basebackup:
>     max-rate: '100M'
>     checkpoint: 'fast'
>
> watchdog:
>   mode: required
>   device: /dev/watchdog
>   safety_margin: 5
>
> tags:
>   nofailover: false
>   noloadbalance: false
>   clonefrom: false
>   nosync: false
>
> 4) Definition of 'ridiculous rate'.
>
> 1GB / day
>
> 5) Relevant information from the logs.
>
> Below entry is something taken off today's log  until this point in time
> which I think it might be relevant. I cannot see any specifics. If there is
> anything else please let me know.
>
> 2<REDACTED>:<REDACTED> GMT [186889]: [863-1] db=,user=,app=,client= LOG:
>  00000: checkpoint starting: time
> 2<REDACTED>:<REDACTED> GMT [186889]: [864-1] db=,user=,app=,client=
> LOCATION:  LogCheckpointStart, xlog.c:6121
> 2<REDACTED>:<REDACTED> GMT [186889]: [865-1] db=,user=,app=,client= LOG:
>  00000: checkpoint complete: wrote 66 buffers (0.4%); 0 WAL file(s) added,
> 0 removed, 0 recycled; write=6.563 s, sync=0.003 s, total=6.619 s; sync
> files=22, longest=0.002 s, average=0.001 s; distance=776 kB, estimate=56426
> kB
> 2<REDACTED>:<REDACTED> GMT [186889]: [866-1] db=,user=,app=,client=
> LOCATION:  LogCheckpointEnd, xlog.c:6202
> 2<REDACTED>:<REDACTED> GMT [2439188]: [7-1]
> db=documentation-database,user=documentation-database-user,app=PostgreSQL
> JDBC Driver,client=<REDACTED> LOG:  00000: disconnection: session time:
> 0:<REDACTED> user=documentation-database-user
> database=documentation-database host=<REDACTED> port=56170
>
>
> @Laurenz
>
> I guess you are referring to
> https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/
>
> *Yes, that is the one.*
>
> I listed all the reasons I know for your predicament.
> Did you do some research along these lines?
>
> *I've had a look at the things that you have mentioned in the guide. *
>
> If yes, what did you find?
>
> *I've not managed to test the queries out yet. But I am planning to test
> out in my lab environment.*
> *Sorry am really cautious about this as those are the main production
> databases.*
>
> *Hope the above is going to give a bit of insight on the root cause of the
> problem.*
>
>
>
> Yours,
> Laurenz Albe
>
>
>
> On Wed, Jan 22, 2025 at 6:03 PM Adrian Klaver <[email protected]>
> wrote:
>
>> On 1/22/25 09:33, Paul Brindusa wrote:
>> > Good afternoon,
>> >
>> > Following below we are facing a similar issue and im getting a real
>> buzz
>> > to get this working myself, speaking to my DBA  in the  company has
>> > actually left me a bit cold as he is not good with postgres.
>> >
>> > So I want to try and get a solution for this and fix this issue with
>> the
>> > pg_wal files filling up the drive at a ridiculous rate. I have been
>> > manually moving logs to a different directory but have had no luck in
>> > finding an actual solution.
>> >
>> > The cluster is a 3 node cluster with HA which is running wirth patroni.
>> >
>> > Please help me out, I will mention that I have test cluster spun up in
>> > case something needs testing.
>> >
>> > Also want to give a shout out to Lorenz Albe's for posting stuff about
>> > wal files on his company blog.
>> >
>> > Again any help will be greatly appreciated.
>>
>> A good deal more information is needed to troubleshoot this:
>>
>> 1) Postgres version(s).
>>
>> 2) The Patroni version.
>>
>> 3) The Patroni configuration.
>>
>> 4) Definition of 'ridiculous rate'.
>>
>> 5) Relevant information from the logs.
>>
>> >
>> >
>> > " On one of our postgres instances we have the pg_wal/data folder up to
>> > 196GB, out of 200GB disk filled up.
>> > This has stopped the posgresql.service this morning causing two
>> > applications to crash.
>> > Unfortunately our database admin is on leave today, and we are trying
>> to
>> > figure out how to get the disk down?
>> > Any ideas or suggestions are more than welcome.
>> >
>> > Thank you in advance."
>> >
>> >
>> > --
>> > Kind Regards,
>> > Paul Brindusa
>> > [email protected] <mailto:[email protected]>
>> >
>>
>> --
>> Adrian Klaver
>> [email protected]
>>
>>
>
> --
> Kind Regards,
> Paul Brindusa
> [email protected]
>
>

view thread (5+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Return of the pg_wal issue..
  In-Reply-To: <CAN3jBgFEX-fhXuNkrMYwCeWjtYK2_zSrvEmefkUZciLPHK7Psw@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox