Re: Return of the pg_wal issue..

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Paul Brindusa <[email protected]>
To: Saul Perdomo <[email protected]>
Cc: pgsql-general <[email protected]>
Subject: Re: Return of the pg_wal issue..
Date: Thu, 23 Jan 2025 15:56:48 +0000
Message-ID: <CAFeSbqhZx-LxgDH48v-LqJd5QGJoQSpRFuGrW92_bTCT+Qv9dA@mail.gmail.com> (raw)
In-Reply-To: <CAN3jBgG8W+hiQqUGtJaVkE3wfTmmLO_XqQt=PPvGQbHMcgmndg@mail.gmail.com>
References: <CAFeSbqh0Mj3bm9+aCaz5g4NhKn8+t4aGF=p5vOPc5oVssveATQ@mail.gmail.com>
	<[email protected]>
	<CAFeSbqijFCW9xFOfapTzebbPcv2sWgpgrS1kVfFNJ+F7sA8R=A@mail.gmail.com>
	<CAN3jBgFEX-fhXuNkrMYwCeWjtYK2_zSrvEmefkUZciLPHK7Psw@mail.gmail.com>
	<CAFeSbqhs_5M-oz136PB_a1RgNQh5PBGSj-k_zL8hsmfV9ZTptw@mail.gmail.com>
	<CAN3jBgG8W+hiQqUGtJaVkE3wfTmmLO_XqQt=PPvGQbHMcgmndg@mail.gmail.com>

Wow, hats off to you kind sir!
Will defo look into this. Thank you for your support with this.

On Thu, Jan 23, 2025 at 2:51 PM Saul Perdomo <[email protected]> wrote:

> In a nutshell: Remember the ticker-tape of old? (You know, the festive
> paper strips tossed out of high-rises by stockbrokers, that float on down
> on city parades after a world war is won in black and white movies?) Those
> were market-transaction serial records.. Similar to those guys, the
> write-ahead logfiles are records of every single DB action that isn't
> read-only.
>
> Now, one could think "well if I've already committed the transactions to
> the database, why do I need to keep the receipts?" And, well, strictly
> speaking you don't -- but you really want to, because they could serve to
> replay, *in perfect order, *the database activity for any given period.
> This data is *gold *when it comes to, say, recovering from disaster.
>
> This is why everybody will tell you "don't just delete these files,
> archive them properly!" Again, for operational purposes, you could just
> delete them. But you really want to make a *copy *of them before you
> do... you know, *just in case *something bad happens to your DB that
> makes you want to roll it back in time.
>
> Enter the "archive_command" function. Instead of simply deleting them, you
> tell PG to pass the files on to another piece of software that knows what
> to do with this highly valuable data. To avoid complicating matters
> further, at this point I'll simply recommend that you use PGBackRest for
> this, it's my favorite piece of software beside the PG server.
>
> Or, you know, you could just delete them. You really don't want to,
> though. Get PGBackRest going and the concepts will click for you as you
> progress along setting it up: https://pgbackrest.org/user-guide.html
>
> Cheers
> Saul
>
> On Thu, Jan 23, 2025 at 9:40 AM Paul Brindusa <[email protected]>
> wrote:
>
>> Hi Saul,
>>
>> Fantastic, highly appreciate that.
>> Initially i've taken  the top bit of Laurenz's post to understand what
>> these wal files are.
>> Any help is more than welcome, apologies for not checking that bit
>> earlier. In the meantime i've checked those queries as well.
>>
>> Thank you
>>
>>
>> On Thu, Jan 23, 2025 at 2:27 PM Saul Perdomo <[email protected]>
>> wrote:
>>
>>> Hey Paul,
>>>
>>> Regarding
>>>
>>> *"I've not managed to test the queries out yet. But I am planning to
>>> test out in my lab environment.*
>>> *Sorry am really cautious about this as those are the main production
>>> databases."*
>>>
>>> As a dispassionate third-party observer, I can confirm that all SELECT
>>> and SHOW queries from Laurenz's blog post are read-only. They're completely
>>> safe to run in the affected environment.
>>>
>>> On Thu, Jan 23, 2025 at 6:40 AM Paul Brindusa <[email protected]>
>>> wrote:
>>>
>>>> Hopefully the below is going to give a little bit more insight on the
>>>> issue.
>>>> I will mention as well that the cluster also replicates data to another
>>>> mysql database if it's relevant at all.
>>>> Also worth noting this is our production cluster and we have another
>>>> pre-production cluster with basically the same settings and the issue there
>>>> does not occur.
>>>>
>>>> A good deal more information is needed to troubleshoot this:
>>>>
>>>> 1) Postgres version(s).
>>>>
>>>> postgres (PostgreSQL) 15.10
>>>>
>>>> 2) The Patroni version.
>>>>
>>>> patroni 4.0.4
>>>>
>>>> 3) The Patroni configuration.
>>>>
>>>> scope: postgres-cluster
>>>> name: db01
>>>> namespace: /service/
>>>>
>>>> log:
>>>>   level: INFO
>>>>   traceback_level: ERROR
>>>>   format: "%(asctime)s %(levelname)s: %(message)s"
>>>>   dateformat: ""
>>>>   max_queue_size: 1000
>>>>   dir: /var/log/patroni
>>>>   file_num: 4
>>>>   file_size: 25000000
>>>>   loggers:
>>>>     patroni.postmaster: WARNING
>>>>     urllib3: WARNING
>>>>
>>>> restapi:
>>>>   listen: x.x.x.98:8008
>>>>   connect_address: x.x.x.98:8008
>>>>
>>>> etcd3:
>>>>   hosts: db01.local:2379,db02.local:2379,db03.local:2379
>>>>
>>>>
>>>> bootstrap:
>>>>   dcs:
>>>>     ttl: 30
>>>>     loop_wait: 10
>>>>     retry_timeout: 10
>>>>     maximum_lag_on_failover: 1048576
>>>>     postgresql:
>>>>       use_pg_rewind: true
>>>>       use_slots: true
>>>>       parameters:
>>>>         max_connections: 500
>>>>         superuser_reserved_connections: 5
>>>>         password_encryption: scram-sha-256
>>>>         max_locks_per_transaction: 512
>>>>         max_prepared_transactions: 0
>>>>         huge_pages: try
>>>>         shared_buffers: 128MB
>>>>         effective_cache_size: 4GB
>>>>         work_mem: 128MB
>>>>         maintenance_work_mem: 256MB
>>>>         checkpoint_timeout: 15min
>>>>         checkpoint_completion_target: 0.9
>>>>         min_wal_size: 80MB
>>>>         max_wal_size: 1GB
>>>>         wal_buffers: 32MB
>>>>         default_statistics_target: 1000
>>>>         seq_page_cost: 1
>>>>         random_page_cost: 4
>>>>         effective_io_concurrency: 2
>>>>         synchronous_commit: on
>>>>         autovacuum: on
>>>>         autovacuum_max_workers: 5
>>>>         autovacuum_vacuum_scale_factor: 0.01
>>>>         autovacuum_analyze_scale_factor: 0.01
>>>>         autovacuum_vacuum_cost_limit: 500
>>>>         autovacuum_vacuum_cost_delay: 2
>>>>         autovacuum_naptime: 1s
>>>>         max_files_per_process: 4096
>>>>         archive_mode: on
>>>>         archive_timeout: 1800s
>>>>         archive_command: cd .
>>>>         wal_level: replica
>>>>         wal_keep_size: 2GB
>>>>         max_wal_senders: 10
>>>>         max_replication_slots: 10
>>>>         hot_standby: on
>>>>         wal_log_hints: on
>>>>         wal_compression: on
>>>>         shared_preload_libraries: pgaudit
>>>>         track_io_timing: on
>>>>         log_lock_waits: on
>>>>         log_temp_files: 0
>>>>         track_activities: on
>>>>         track_counts: on
>>>>         track_functions: all
>>>>         log_checkpoints: on
>>>>         logging_collector: on
>>>>         log_truncate_on_rotation: on
>>>>         log_rotation_age: 1d
>>>>         log_rotation_size: 1GB
>>>>         log_line_prefix: '%m [%p]: [%l-1]
>>>> db=%d,user=%u,app=%a,client=%h '
>>>>         log_filename: postgresql-%Y-%m-%d.log
>>>>         log_directory: /var/log/pgsql
>>>>         log_connections: on
>>>>         log_disconnections: on
>>>>         log_statement: ddl
>>>>         log_error_verbosity: verbose
>>>>         hot_standby_feedback: on
>>>>         max_standby_streaming_delay: 30s
>>>>         wal_receiver_status_interval: 10s
>>>>         idle_in_transaction_session_timeout: 10min
>>>>         jit: off
>>>>         max_worker_processes: 24
>>>>         max_parallel_workers: 8
>>>>         max_parallel_workers_per_gather: 2
>>>>         max_parallel_maintenance_workers: 2
>>>>
>>>>   initdb:
>>>>   - encoding: UTF8
>>>>   - data-checksums
>>>>
>>>>   pg_hba:
>>>>   - host replication replicator 127.0.0.1/32 md5
>>>>
>>>>   - host replication replicator x.x.x.98/27 scram-sha-256
>>>>
>>>>
>>>>
>>>>   - host replication replicator x.x.x.99/27 scram-sha-256
>>>>
>>>>
>>>>
>>>>   - host replication replicator x.x.x.100/27 scram-sha-256
>>>>
>>>>
>>>>   - host all all 0.0.0.0/0 md5
>>>>
>>>> postgresql:
>>>>   listen: x.x.x.98:5432
>>>>   connect_address: x.x.x.98:5432
>>>>   data_dir: /var/lib/pgsql/data
>>>>   bin_dir: /usr/bin
>>>>   pgpass: /var/lib/pgsql/.pgpass_patroni
>>>>   authentication:
>>>>     replication:
>>>>       username: replicator
>>>>       password: password
>>>>     superuser:
>>>>       username: postgres
>>>>       password: password
>>>>   parameters:
>>>>     unix_socket_directories: /var/run/postgresql
>>>>
>>>>   remove_data_directory_on_rewind_failure: false
>>>>   remove_data_directory_on_diverged_timelines: false
>>>>
>>>>   create_replica_methods:
>>>>     - basebackup
>>>>   basebackup:
>>>>     max-rate: '100M'
>>>>     checkpoint: 'fast'
>>>>
>>>> watchdog:
>>>>   mode: required
>>>>   device: /dev/watchdog
>>>>   safety_margin: 5
>>>>
>>>> tags:
>>>>   nofailover: false
>>>>   noloadbalance: false
>>>>   clonefrom: false
>>>>   nosync: false
>>>>
>>>> 4) Definition of 'ridiculous rate'.
>>>>
>>>> 1GB / day
>>>>
>>>> 5) Relevant information from the logs.
>>>>
>>>> Below entry is something taken off today's log  until this point in
>>>> time which I think it might be relevant. I cannot see any specifics. If
>>>> there is anything else please let me know.
>>>>
>>>> 2<REDACTED>:<REDACTED> GMT [186889]: [863-1] db=,user=,app=,client=
>>>> LOG:  00000: checkpoint starting: time
>>>> 2<REDACTED>:<REDACTED> GMT [186889]: [864-1] db=,user=,app=,client=
>>>> LOCATION:  LogCheckpointStart, xlog.c:6121
>>>> 2<REDACTED>:<REDACTED> GMT [186889]: [865-1] db=,user=,app=,client=
>>>> LOG:  00000: checkpoint complete: wrote 66 buffers (0.4%); 0 WAL file(s)
>>>> added, 0 removed, 0 recycled; write=6.563 s, sync=0.003 s, total=6.619 s;
>>>> sync files=22, longest=0.002 s, average=0.001 s; distance=776 kB,
>>>> estimate=56426 kB
>>>> 2<REDACTED>:<REDACTED> GMT [186889]: [866-1] db=,user=,app=,client=
>>>> LOCATION:  LogCheckpointEnd, xlog.c:6202
>>>> 2<REDACTED>:<REDACTED> GMT [2439188]: [7-1]
>>>> db=documentation-database,user=documentation-database-user,app=PostgreSQL
>>>> JDBC Driver,client=<REDACTED> LOG:  00000: disconnection: session time:
>>>> 0:<REDACTED> user=documentation-database-user
>>>> database=documentation-database host=<REDACTED> port=56170
>>>>
>>>>
>>>> @Laurenz
>>>>
>>>> I guess you are referring to
>>>> https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/
>>>>
>>>> *Yes, that is the one.*
>>>>
>>>> I listed all the reasons I know for your predicament.
>>>> Did you do some research along these lines?
>>>>
>>>> *I've had a look at the things that you have mentioned in the guide. *
>>>>
>>>> If yes, what did you find?
>>>>
>>>> *I've not managed to test the queries out yet. But I am planning to
>>>> test out in my lab environment.*
>>>> *Sorry am really cautious about this as those are the main production
>>>> databases.*
>>>>
>>>> *Hope the above is going to give a bit of insight on the root cause of
>>>> the problem.*
>>>>
>>>>
>>>>
>>>> Yours,
>>>> Laurenz Albe
>>>>
>>>>
>>>>
>>>> On Wed, Jan 22, 2025 at 6:03 PM Adrian Klaver <
>>>> [email protected]> wrote:
>>>>
>>>>> On 1/22/25 09:33, Paul Brindusa wrote:
>>>>> > Good afternoon,
>>>>> >
>>>>> > Following below we are facing a similar issue and im getting a real
>>>>> buzz
>>>>> > to get this working myself, speaking to my DBA  in the  company has
>>>>> > actually left me a bit cold as he is not good with postgres.
>>>>> >
>>>>> > So I want to try and get a solution for this and fix this issue with
>>>>> the
>>>>> > pg_wal files filling up the drive at a ridiculous rate. I have been
>>>>> > manually moving logs to a different directory but have had no luck
>>>>> in
>>>>> > finding an actual solution.
>>>>> >
>>>>> > The cluster is a 3 node cluster with HA which is running wirth
>>>>> patroni.
>>>>> >
>>>>> > Please help me out, I will mention that I have test cluster spun up
>>>>> in
>>>>> > case something needs testing.
>>>>> >
>>>>> > Also want to give a shout out to Lorenz Albe's for posting stuff
>>>>> about
>>>>> > wal files on his company blog.
>>>>> >
>>>>> > Again any help will be greatly appreciated.
>>>>>
>>>>> A good deal more information is needed to troubleshoot this:
>>>>>
>>>>> 1) Postgres version(s).
>>>>>
>>>>> 2) The Patroni version.
>>>>>
>>>>> 3) The Patroni configuration.
>>>>>
>>>>> 4) Definition of 'ridiculous rate'.
>>>>>
>>>>> 5) Relevant information from the logs.
>>>>>
>>>>> >
>>>>> >
>>>>> > " On one of our postgres instances we have the pg_wal/data folder up
>>>>> to
>>>>> > 196GB, out of 200GB disk filled up.
>>>>> > This has stopped the posgresql.service this morning causing two
>>>>> > applications to crash.
>>>>> > Unfortunately our database admin is on leave today, and we are
>>>>> trying to
>>>>> > figure out how to get the disk down?
>>>>> > Any ideas or suggestions are more than welcome.
>>>>> >
>>>>> > Thank you in advance."
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Kind Regards,
>>>>> > Paul Brindusa
>>>>> > [email protected] <mailto:[email protected]>
>>>>> >
>>>>>
>>>>> --
>>>>> Adrian Klaver
>>>>> [email protected]
>>>>>
>>>>>
>>>>
>>>> --
>>>> Kind Regards,
>>>> Paul Brindusa
>>>> [email protected]
>>>>
>>>>
>>
>> --
>> Kind Regards,
>> Paul Brindusa
>> [email protected]
>>
>>

-- 
Kind Regards,
Paul Brindusa
[email protected]

view thread (3+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Return of the pg_wal issue..
  In-Reply-To: <CAFeSbqhZx-LxgDH48v-LqJd5QGJoQSpRFuGrW92_bTCT+Qv9dA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox