Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1twqqq-000acv-3T for pgsql-hackers@arkaria.postgresql.org; Mon, 24 Mar 2025 23:05:10 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1twqoi-00BNjw-8d for pgsql-hackers@arkaria.postgresql.org; Mon, 24 Mar 2025 23:02:48 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1twqoh-00BNje-RN for pgsql-hackers@lists.postgresql.org; Mon, 24 Mar 2025 23:02:47 +0000 Received: from mail-vk1-xa29.google.com ([2607:f8b0:4864:20::a29]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1twqoe-000xcg-32 for pgsql-hackers@postgresql.org; Mon, 24 Mar 2025 23:02:47 +0000 Received: by mail-vk1-xa29.google.com with SMTP id 71dfb90a1353d-523e895dd3dso164588e0c.2 for ; Mon, 24 Mar 2025 16:02:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=upgrade.com; s=google; t=1742857363; x=1743462163; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=G7wJKdDnl3CGSMdNgNyMefsSeXadOnnyFBAI3u/L8TU=; b=JcKi1wJ3oo54dZ4SZS2o/frU8lS4lVKYmcAr8+ASMeFmjKDXCEfpTQqw6XP14jrf7o MUGv4rSugbcxbS1+EtLSgRGl75PTtjyG+ZcXA5JzsMc82xQ7Q4x6aHbQBkDj/aU6lZBq zHaaqRpIN/lzAbUs8DJ2o5ijoJxELBTp2+2wEgpVa+fMOp7yN0l/Fh9Szb5N9vLFEqrH K1LfnX7klIx9uDoUtVlZaBAyxe2+FEkINsEFVbuCDogYIXK6m/tLgajsmmf9KWfw0ZFI Ta9le2rolYYsoPjJQzFiiNxiqeqLhj45Ni8MhrYdI/d/4tgRfqrBHEyJ+Cnrwcs+CGBi wNZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742857363; x=1743462163; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=G7wJKdDnl3CGSMdNgNyMefsSeXadOnnyFBAI3u/L8TU=; b=ZmjBvFsewdS4mkJqtuFtUpQy/2m+5gXzcTFL7gJByTNkK3FnE70U7VsOOjHl7flFmj Rne34Uct8mUNljAg7RZz4y/qkHBmn1OyJSzglQEd+PVWaOh9XAzz6LDvhATbmyYyOC65 RYdKFP9VDh1pzBfMKiV8UDFZCRGTywPKho6TCw8tRZP0nSKpHtmOiXsyqwzUMfAF0SgN g023pJ3hmExnoVWizQ6zGb3ad3VbZrRA2b9Xs3gHg8RXQA/T5Za+t2KonA0fufIm3fID zIiYcSpOwWIp34+D0zn+e/Qw9F1AKrUue/A3/Wb4wCfNPvPhfb0jiqaFrz4mt5UZXVX9 PQJA== X-Forwarded-Encrypted: i=1; AJvYcCWfD1h8Oontdksfn8G9Aiu3H5CGnBsyMwMUOWi4C0bKio9YEUjXHYQCUFQwAW3iA5tFBeC7ZhpytwrEZVkj@postgresql.org X-Gm-Message-State: AOJu0YyM8ez5ICXhgKAjfktQIhTSHzIrCxSO+UifDvyaUnO4szf6aYn9 PyXYK9xoNAXb4ILiPSXdLSYfVflYVknghXF2HPngsoqxtnpS79xmqIo/XvrwsILqq1NqBO3XvFf 1PxWGalZGUBH2XTK1NvbZJa6OwjmlTd6vuIlaow== X-Gm-Gg: ASbGncuHDeuX25vZPoT4hTQGtRkcsKz0c2G/eqFh7fbzodR8LFbIfxpvn86u40fLzzO 8Sx9dCgJRu7MR4nORJnQpIDqzKIDsnSfNBiHwPFsYQnzBPXMgil0IYUXPdiM3R1a0V9UzUoiF5e Gpksrm5MZML56zbRxwJUsEyBJW+bW/mNouTZhc X-Google-Smtp-Source: AGHT+IF3/9vCtoy2vfAr6HvFZgZHtqJuJqZ5OHJemjimL7RK+yAduPU/blpO5l4d7XKN/s3W5Y65J0H7dQBnnGRMDX0= X-Received: by 2002:a05:6122:2202:b0:520:4095:ad2f with SMTP id 71dfb90a1353d-525e40c3ef9mr111113e0c.1.1742857362114; Mon, 24 Mar 2025 16:02:42 -0700 (PDT) MIME-Version: 1.0 References: <86f76aa5-1ab5-4e2e-9b15-405051852a2a@postgrespro.ru> <1e81a0a1-a63b-48fb-905a-d6495f89ab73@postgrespro.ru> <0b4eefc7-4c38-4caa-b2ca-a4c75dd7dd12@postgrespro.ru> <333c2306-c401-4959-9f0c-a44c670a11a9@postgrespro.ru> <513f0188-b093-4cc8-98cf-4c324570d525@postgrespro.ru> <47a7b784-5218-43f2-96e3-65f9a729c5a5@tantorlabs.com> In-Reply-To: From: Jim Nasby Date: Mon, 24 Mar 2025 18:02:31 -0500 X-Gm-Features: AQ5f1JrTTerG0-I1G8t-g6n9ShtO0S-OXH32FJPv6vRa0RiocovqKmstEenM1q8 Message-ID: Subject: Re: Vacuum statistics To: Alena Rybakina Cc: Bertrand Drouvot , vignesh C , Ilia Evdokimov , pgsql-hackers , Alexander Korotkov , Kirill Reshke , Andrei Zubkov , Masahiko Sawada , Melanie Plageman , jian he , a.lepikhov@postgrespro.ru, Sami Imseih Content-Type: multipart/alternative; boundary="000000000000f6b7c006311e99c6" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000f6b7c006311e99c6 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Mar 21, 2025 at 2:42=E2=80=AFPM Alena Rybakina wrote: > On 13.03.2025 09:42, Bertrand Drouvot wrote: > > Hi, > > On Wed, Mar 12, 2025 at 05:15:53PM -0500, Jim Nasby wrote: > > The usecase I can see here is that we don't want autovac creating so much > WAL traffic that it starts forcing other backends to have to write WAL ou= t. > But tracking how many times autovac writes WAL buffers won't help with th= at > > Right, because the one that increments the wal_buffers_full metric could = "just" > be a victim (i.e the one that happens to trigger the WAL buffers disk flu= sh, > even though other backends contributed most of the buffer usage). > > > (though we also don't want any WAL buffers written by autovac to be count= ed > in the system-wide wal_buffers_full: > > why? Or do you mean that it would be good to have 2 kinds of metrics: one > generated by "maintenance" activity and one by "regular" backends? > > See below... > What would be helpful would be a way to determine if autovac was causing > enough traffic to force other backends to write WAL. Offhand I'm not sure > how practical that actually is though. > > a051e71e28a could help to see how much WAL has by written by the autovac = workers. > > I still don't think that helps (see below) > BTW, there's also an argument to be made that autovac should throttle > itself if we're close to running out of available WAL buffers... > > hmm, yeah I think that's an interesting idea OTOH that would mean to "del= egate" > the WAL buffers flush to another backend. > > Maybe it does, maybe it doesn't... but now I think you're getting why I'm complaining about the proposed WAL flush metrics: who *flushes* WAL tells you absolutely nothing about who generated the WAL. Not only that, but flushing WAL isn't necessarily even bad: a user backend can't COMMIT without flushing some amount of WAL (ignoring async-commit of course). That really casts the whole idea of having stats on who's flushing how much WAL in a new light: you can NOT use any such metric without a bunch of other context; including who else was flushing how much WAL, whether WAL had to absolutely be flushed anyway (ie, at bare minimum a COMMIT must flush enough WAL to cover the commit record), and even where all the WAL is coming from in the first place. Though now that I think about it... if we're reporting how much WAL is being generated by vacuum, then *maybe* it's helpful to also report how much WAL is being flushed by vacuum. My emphasis on *maybe* is because it's fine if autovac is writing more than it flushes, so long as the remainder is being flushed by the checkpointer and not user backends... but you could also determine that just by looking at how much WAL backends are flushing. Basically, I'm leaning towards it would be best to rethink the whole purpose of reporting WAL flush metrics before we further muddy the waters by adding vacuum stats about it. At minimum we should have a metric that shows how much WAL backends flushed because they *had* to due to synchronous commit settings (which does affect more than just COMMIT). > I will add it and fix the tests but later and I'll explain why. > > I'm working on this issue [0] and try have already created new statistics > in Statistics Collector to store database and relation vacuum statistics: > PGSTAT_KIND_VACUUM_DB and PGSTAT_KIND_VACUUM_RELATION. > > Vacuum statistics are saved there instead of relation's and database's > statistic structure, but for some reason it is not possible to find them = in > the hash table when building a snapshot and display them accordingly. > I have not yet figured out where the error is. > > Without solving this problem, committing vacuum statistics is not yet > possible. An alternative way for us was to refuse some statistics for now > for relations, > but we could not agree on which statistics should not be displayed yet an= d > for now we are only adding them :). > > I understand why this is important to display more vacuum information > about vacuum statistics - it will allow us to better understand the > problems of incorrect vacuum settings or, for example, notice a bug in it= s > operation. > > In order to reduce the memory consumption for storing them for those who > are not going to use them, I just realized that we need to create a > separate space for storing the statistics > I mentioned above (PGSTAT_KIND_VACUUM_DB and PGSTAT_KIND_VACUUM_RELATION)= , > there is no other way to do this and I am still trying to complete this > functionality. > > I doubt that I will have time for this by code freeze date and even if I > do, I will hardly have time for a normal review. There's really a lot mor= e > to learn related to the stat collector, so > I'm postponing it to the next commitfest. > > Sorry. I'll fix the tests as soon as I finish this part, since they'll > most likely either break the same way or in some new way. > > Tomorrow or the day after tomorrow I will send a diff patch with what I > have already managed to demonstrate the problem, since I need to bring th= e > code to a normal form. > Maybe someone who worked with the stat collector will suddenly tell me > where and what I have implemented incorrectly. > > -- > Regards, > Alena Rybakina > Postgres Professional > > --000000000000f6b7c006311e99c6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Fri, Mar 21, 2025 at 2:42=E2=80=AFPM A= lena Rybakina <a.rybakina@p= ostgrespro.ru> wrote:
=20 =20 =20
On 13.03.2025 09:42, Bertrand Drouvot wrote:
Hi,

On Wed, Mar 12, 2025 at 05:15:53PM -0500, Jim Nasby wrote:
The usecase I can see here is that we don't want autovac c=
reating so much
WAL traffic that it starts forcing other backends to have to write WAL out.
But tracking how many times autovac writes WAL buffers won't help with =
that
Right, because the one that increments the wal_buffers_full metr=
ic could "just"
be a victim (i.e the one that happens to trigger the WAL buffers disk flush=
,
even though other backends contributed most of the buffer usage).

(though we also don't want any WAL buffers written by auto=
vac to be counted
in the system-wide wal_buffers_full:
why? Or do you mean that it would be good to have 2 kinds of met=
rics: one
generated by "maintenance" activity and one by "regular"=
; backends?
=C2=A0See below...

      
What would be helpful would be a way to determine if autovac w=
as causing
enough traffic to force other backends to write WAL. Offhand I'm not su=
re
how practical that actually is though.
a051e71e28a could help to see how much WAL has by written by the=
 autovac workers.
I still don'= ;t think that helps (see below)=C2=A0

      
BTW, there's also an argument to be made that autovac shou=
ld throttle
itself if we're close to running out of available WAL buffers...
hmm, yeah I think that's an interesting idea OTOH that would=
 mean to "delegate"
the WAL buffers flush to another backend.
Maybe it does, maybe it doesn't... but now I think you're = getting why I'm complaining about the proposed WAL flush metrics: who *= flushes* WAL tells you absolutely nothing about who generated the WAL. Not = only that, but flushing WAL isn't necessarily even bad: a user backend = can't COMMIT without flushing some amount of WAL (ignoring async-commit= of course). That really casts the whole idea of having stats on who's = flushing how much WAL in a new light: you can NOT use any such metric witho= ut a bunch of other context; including who else was flushing how much WAL, = whether WAL had to absolutely be flushed anyway (ie, at bare minimum a COMM= IT must flush enough WAL to cover the commit record), and even where all th= e WAL is coming from in the first place.

Though no= w that I think about it... if we're reporting how much WAL is being gen= erated by vacuum, then *maybe* it's helpful to also report how much WAL= is being flushed by vacuum. My emphasis on *maybe* is because it's fin= e if autovac is writing more than it flushes, so long as the remainder is b= eing flushed by the checkpointer and not user backends... but you could als= o determine that just by looking at how much WAL backends are flushing.

Basically, I'm leaning towards it would be best t= o rethink the whole purpose of reporting WAL flush metrics before we furthe= r muddy the waters by adding vacuum stats about it. At minimum we should ha= ve a metric that shows how much WAL backends flushed because they *had* to = due to synchronous commit settings (which does affect more than just COMMIT= ).

I will ad= d it and fix the tests but later and I'll explain why.=C2=A0

I'm working on this issue [0] and try have already created new statistics in Statistics Collector to store database and relation vacuum statistics: PGSTAT_KIND_VACUUM_DB and PGSTAT_KIND_VACUUM_RELATION.

Vacuum statistics are saved there instead of relation's and database's statistic structure, but for some reason it is not possible to find them in the hash table when building a snapshot and display them accordingly.
I have not yet figured out where the error is.=C2=A0

Without solving this problem, committing vacuum statistics is not yet possible. An alternative way for us was to refuse some statistics for now for relations,
but we could not agree on which statistics should not be displayed yet and for now we are only adding them :).

I understand why this is important to display more vacuum information about vacuum statistics - it will allow us to better understand the problems of incorrect vacuum settings or, for example, notice a bug in its operation.=C2=A0

In order to reduce the memory consumption for storing them for those who are not going to use them, I just realized that we need to create a separate space for storing the statistics
I mentioned above (PGSTAT_KIND_VACUUM_DB and PGSTAT_KIND_VACUUM_RELATION), there is no other way to do this and I am still trying to complete this functionality.=C2=A0

I doubt that I will have time for this by code freeze date and even if I do, I will hardly have time for a normal review. There'= s really a lot more to learn related to the stat collector, so
I'm postponing it to the next commitfest.=C2=A0

Sorry. I'll fix the tests as soon as I finish this part, since they'll most likely either break the same way or in some new way.=

Tomorrow or the day after tomorrow I will send a diff patch with what I have already managed to demonstrate the problem, since I need to bring the code to a normal form.
Maybe someone who worked with the stat collector will suddenly tell me where and what I have implemented incorrectly.

--=20
Regards,
Alena Rybakina
Postgres Professional
--000000000000f6b7c006311e99c6--