public inbox for [email protected]help / color / mirror / Atom feed
Re: relfilenode statistics 24+ messages / 4 participants [nested] [flat]
* Re: relfilenode statistics @ 2025-03-13 09:00 Kirill Reshke <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Kirill Reshke @ 2025-03-13 09:00 UTC (permalink / raw) To: Bertrand Drouvot <[email protected]>; +Cc: Robert Haas <[email protected]>; Michael Paquier <[email protected]>; [email protected] On Fri, 3 Jan 2025 at 21:18, Bertrand Drouvot <[email protected]> wrote: > As mentioned by Andres in [1], relying on the relation OID would not work to > "recover" the stats because we don't have access to the relation oid during crash > recovery. So, I'm going to resume working on the "initial" idea (i.e having the > stats keyed by relfilenode). > > [1]: https://www.postgresql.org/message-id/xvetwjsnkhx2gp6np225g2h64f4mfmg6oopkuaiivrpzd2futj%40pflk55su3... > Hmm. While it is true that catalog lookups cannot be performed during crash recovery, is it really necessary to save and retrieve statistics after a crash? Given that statistics are permitted to be outdated and server crashes are anticipated to be infrequent, it looks loke losing a few analysis runs due to server crashes is acceptable. In any case, I am totally OK with the relfilenode-based method because it is generally less restricted (to other postgresql parts e.g. wal- replay ) and simpler. Also, this patch needs a rebase;) -- Best regards, Kirill Reshke ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-09-16 06:44 Michael Paquier <[email protected]> parent: Kirill Reshke <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Michael Paquier @ 2025-09-16 06:44 UTC (permalink / raw) To: Kirill Reshke <[email protected]>; +Cc: Bertrand Drouvot <[email protected]>; Robert Haas <[email protected]>; [email protected] On Thu, Mar 13, 2025 at 02:00:52PM +0500, Kirill Reshke wrote: > Hmm. While it is true that catalog lookups cannot be performed during > crash recovery, is it really necessary to save and retrieve statistics > after a crash? Yes, losing stats on crash is a *very* annoying thing. Having no stats for a relation means that autovacuum gives up entirely on relations it has no stats of, skipping it entirely until they have rebuilt and bloat would accumulate. Being able to recover these stats from crash recovery is a cheap design, that would improve reliability by a large degree. > Given that statistics are permitted to be outdated and > server crashes are anticipated to be infrequent, it looks loke losing > a few analysis runs due to server crashes is acceptable. > In any case, I am totally OK with the relfilenode-based method because > it is generally less restricted (to other postgresql parts e.g. wal- > replay ) and simpler. The startup process is not connected to a database and has no access to pg_class: the only thing we can know about are the on-disk files, not their in-catalog OIDs. FWIW, I think that this patch would be a huge step forward a more reliable stats system. True that the patch needs a rebase. Bertrand has also mentioned that some points needed more work. -- Michael Attachments: [application/pgp-signature] signature.asc (833B, 2-signature.asc) download ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-09-30 10:13 Bertrand Drouvot <[email protected]> parent: Michael Paquier <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2025-09-30 10:13 UTC (permalink / raw) To: Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Tue, Sep 16, 2025 at 03:44:25PM +0900, Michael Paquier wrote: > On Thu, Mar 13, 2025 at 02:00:52PM +0500, Kirill Reshke wrote: > > Hmm. While it is true that catalog lookups cannot be performed during > > crash recovery, is it really necessary to save and retrieve statistics > > after a crash? > > Yes, losing stats on crash is a *very* annoying thing. Having no > stats for a relation means that autovacuum gives up entirely on > relations it has no stats of, skipping it entirely until they have > rebuilt and bloat would accumulate. Being able to recover these stats > from crash recovery is a cheap design, that would improve reliability > by a large degree. +1. > The startup process is not connected to a database and has no access > to pg_class: the only thing we can know about are the on-disk files, > not their in-catalog OIDs. FWIW, I think that this patch would be a > huge step forward a more reliable stats system. > > True that the patch needs a rebase. Bertrand has also mentioned that > some points needed more work. Right. I'll come back with a rebase, and a POC proposal on some stats so that we could agree on the design. Also, it looks like that we have a consensus on "sometimes I don't know the relation OID so I want to use the relfilenumber instead, without changing the user experience" (see [1)). As far Michael's concern about adding a new field in the hash key, as 8 bytes is allocated for the object ID, then we can go with: dboid (linked to RelFileLocator's dbOid) objoid (linked to RelFileLocator's spcOid and to the RelFileLocator's relNumber) and avoid adding a new field in the key. [1]: https://www.postgresql.org/message-id/CA%2BTgmoZ0u6ek_xxYJaGVBk0uEvH5txoYsCwbvxKWe-2xn_G_qg%40mail.g... Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-09-30 23:05 Michael Paquier <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Michael Paquier @ 2025-09-30 23:05 UTC (permalink / raw) To: Bertrand Drouvot <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] On Tue, Sep 30, 2025 at 10:13:57AM +0000, Bertrand Drouvot wrote: > As far Michael's concern about adding a new field in the hash key, as 8 bytes > is allocated for the object ID, then we can go with: > > dboid (linked to RelFileLocator's dbOid) > objoid (linked to RelFileLocator's spcOid and to the RelFileLocator's relNumber) > > and avoid adding a new field in the key. RelFileNumber is a 4-byte Oid, so this mapping should be able to work. Is there any reason why you would want an efficient filtering of the contents of the shared hashtable based only on a relnumber or a tablespace OID? Perhaps yes, like when a relfilenode is dropped into a bin for an efficient removal from the shared hashtable so as we don't need to do a seqscan, I just don't remember all the details of the patch and if it could act as a bottleneck in some scenarios. -- Michael Attachments: [application/pgp-signature] signature.asc (833B, 2-signature.asc) download ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-10-01 14:33 Bertrand Drouvot <[email protected]> parent: Michael Paquier <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2025-10-01 14:33 UTC (permalink / raw) To: Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Wed, Oct 01, 2025 at 08:05:16AM +0900, Michael Paquier wrote: > On Tue, Sep 30, 2025 at 10:13:57AM +0000, Bertrand Drouvot wrote: > > As far Michael's concern about adding a new field in the hash key, as 8 bytes > > is allocated for the object ID, then we can go with: > > > > dboid (linked to RelFileLocator's dbOid) > > objoid (linked to RelFileLocator's spcOid and to the RelFileLocator's relNumber) > > > > and avoid adding a new field in the key. > > RelFileNumber is a 4-byte Oid, so this mapping should be able to work. Right. > Is there any reason why you would want an efficient filtering of the > contents of the shared hashtable based only on a relnumber or a > tablespace OID? Not that I can think of currently. > Perhaps yes, like when a relfilenode is dropped into > a bin for an efficient removal from the shared hashtable so as we > don't need to do a seqscan, I just don't remember all the details of > the patch and if it could act as a bottleneck in some scenarios. I think the first step is to replace (i.e get rid) PGSTAT_KIND_RELATION by a brand new PGSTAT_KIND_RELFILENODE and move all the existing stats that are currently under the PGSTAT_KIND_RELATION to this new PGSTAT_KIND_RELFILENODE. Let's do this by keeping the pg_stat_all_tables|indexes and pg_statio_all_tables|indexes on top of the PGSTAT_KIND_RELFILENODE and ensure that a relation rewrite keeps those stats. Once done, we could work from there to add new stats (add writes counters and ensure that some counters (n_dead_tup and friends) are replicated). Does that make sense to you? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-10-02 01:23 Michael Paquier <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Michael Paquier @ 2025-10-02 01:23 UTC (permalink / raw) To: Bertrand Drouvot <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] On Wed, Oct 01, 2025 at 02:33:11PM +0000, Bertrand Drouvot wrote: > I think the first step is to replace (i.e get rid) PGSTAT_KIND_RELATION by a brand > new PGSTAT_KIND_RELFILENODE and move all the existing stats that are currently > under the PGSTAT_KIND_RELATION to this new PGSTAT_KIND_RELFILENODE. Likely so, yes. > Let's do this by keeping the pg_stat_all_tables|indexes and pg_statio_all_tables|indexes > on top of the PGSTAT_KIND_RELFILENODE and ensure that a relation rewrite keeps > those stats. Once done, we could work from there to add new stats (add writes > counters and ensure that some counters (n_dead_tup and friends) are replicated). Do you think it is OK to define non-transactional pending stats as being always a subset of the transactional stats? I don't quite see if there would be a case to have stats that are only flushed in a non-transactional path, while being discarded at the stats report done at transaction commit time. This means that it may be possible to structure things so as the pending non-transaction stats structure are always part of the transactional bits, and that the other way around is not possible. Perhaps that influences the design choices, at least a bit. -- Michael Attachments: [application/pgp-signature] signature.asc (833B, 2-signature.asc) download ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-11-07 11:28 Bertrand Drouvot <[email protected]> parent: Michael Paquier <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2025-11-07 11:28 UTC (permalink / raw) To: Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Thu, Oct 02, 2025 at 10:23:11AM +0900, Michael Paquier wrote: > On Wed, Oct 01, 2025 at 02:33:11PM +0000, Bertrand Drouvot wrote: > > I think the first step is to replace (i.e get rid) PGSTAT_KIND_RELATION by a brand > > new PGSTAT_KIND_RELFILENODE and move all the existing stats that are currently > > under the PGSTAT_KIND_RELATION to this new PGSTAT_KIND_RELFILENODE. > > Likely so, yes. PFA the new implementation. It does not introduce a new PGSTAT_KIND_RELFILENODE, instead it keys the PGSTAT_KIND_RELATION by relfile locator. We may want to rename PGSTAT_KIND_RELATION to PGSTAT_KIND_RELFILENODE as a next step. The patch is structured that way: ==== 0001 Add stats tests related to rewrite While there are existing rewrite tests, the stats behavior during rewrites doesn't have a good coverage. This patch adds some tests to record some stats after different rewrite scenarios. That way, we'll be able to test that the stats are still the ones we expect after rewrites. Note that it generates a new stats_1.out (which is quite large), so we may want to move those new tests to "isolation" instead. ==== 0002 Key PGSTAT_KIND_RELATION by relfile locator This patch changes the key used for the PGSTAT_KIND_RELATION statistic kind. Instead of the relation oid, it now relies on: - dboid (linked to RelFileLocator's dbOid) - objoid which is the result of a new macro (namely RelFileLocatorToPgStatObjid()) that computes an objoid based on the RelFileLocator's spcOid and the RelFileLocator's relNumber. That will allow us to add new stats (add writes counters) and ensure that some counters (n_dead_tup and friends) are replicated. The patch introduces pgstat_reloid_to_relfilelocator() to 1) avoid calling RelationIdGetRelation() to get the relfilelocator based on the relation oid and 2) handle the partitioned table case. Please note that: - when running pg_stat_have_stats('relation',...) we now need to be connected to the database that hosts the relation. As pg_stat_have_stats() is not documented publicly, then the changes done in 029_stats_restart.pl look enough. - this patch does not handle rewrites so some tests are failing. It's only intent is to ease the review and should not be pushed without being merged with the following patch that handles the rewrites. - it can be used to test that stats are incremented correctly and that we're able to retrieve them as long as rewrites are not involved. ==== 0003 handle relation statistics correctly during rewrites Now that PGSTAT_KIND_RELATION is keyed by refilenode, we need to handle rewrites. To do so, this patch: - Adds PgStat_PendingRewrite, a new struct to track rewrite operations within a transaction, storing the old locator, new locator, and original locator (for rewrite chains). This allows stats to be copied from the original location to the final location at commit time. - Adds a new function, pgstat_mark_rewrite(), called when a table rewrite begins. It records the rewrite operation in a local list and detects rewrite chains by checking if the old_locator matches any existing new_locator, preserving the chain's original_locator. - Modifies pgstat_copy_relation_stats(), to accept RelFileLocators instead of Relations, with a new increment parameter to accumulate stats (needed for rewrite chains with DML between rewrites). - Ensures that AtEOXact_PgStat_Relations(), AtPrepare_PgStat_Relations(), pgstat_twophase_postcommit()/postabort() pgstat_drop_relation() handle the PgStat_PendingRewrite list correctly. Note that due to the new flush call in pgstat_twophase_postcommit() we can not call GetCurrentTransactionStopTimestamp() in pgstat_relation_flush_cb(). So, adding a check to handle this special case and call GetCurrentTimestamp() instead. Note that we'd call GetCurrentTimestamp() only if there is a rewrite, so that the GetCurrentTimestamp() extra cost should be negligible. Another solution could be to trigger the flush from FinishPreparedTransaction() but that's not worth the extra complexity. The new pending_rewrites list is traversed in multiple places. The overhead should be negligible in comparison to a rewrite and the list should not contain a lot of rewrites in practice. Another design that I tried was to copy the stats in pgstat_mark_rewrite() but that lead to difficulties during abort, subtransactions. It looks to me that the list approach proposed here makes more sense. We could also imagine adding a function similar to pg_stat_have_stats() that would take relfile locator as arguments. That could help validate that after a rewrite the old stats are gone. > Do you think it is OK to define non-transactional pending stats as > being always a subset of the transactional stats? I don't quite see > if there would be a case to have stats that are only flushed in a > non-transactional path, while being discarded at the stats report done > at transaction commit time. This means that it may be possible to > structure things so as the pending non-transaction stats structure are > always part of the transactional bits, and that the other way around > is not possible. Perhaps that influences the design choices, at least > a bit. The proposed patch does not change anything it that regard. It keeps the relation's behavior as it is. This patch just ensure that a relation rewrite keeps its stats. Adding new stats (add writes counters) and ensure that some counters (n_dead_tup and friends) are replicated will be done once this one gets in. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-11-08 23:33 Michael Paquier <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Michael Paquier @ 2025-11-08 23:33 UTC (permalink / raw) To: Bertrand Drouvot <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] On Fri, Nov 07, 2025 at 11:28:27AM +0000, Bertrand Drouvot wrote: > While there are existing rewrite tests, the stats behavior during rewrites > doesn't have a good coverage. This patch adds some tests to record some stats > after different rewrite scenarios. > > That way, we'll be able to test that the stats are still the ones we > expect after rewrites. Note that it generates a new stats_1.out (which is quite > large), so we may want to move those new tests to "isolation" instead. Looking at this part of the patch set for now, not looked at the rest yet. This new stats_1.out is 2k lines long, introduced for the tests related to rewrites as an effect of 2PC. It seems to me that a split into a new stats_rewrite would be justified for this case, to reduce the output duplication. -- Michael Attachments: [application/pgp-signature] signature.asc (833B, 2-signature.asc) download ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-11-10 08:53 Michael Paquier <[email protected]> parent: Michael Paquier <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Michael Paquier @ 2025-11-10 08:53 UTC (permalink / raw) To: Bertrand Drouvot <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] On Sun, Nov 09, 2025 at 08:33:54AM +0900, Michael Paquier wrote: > Looking at this part of the patch set for now, not looked at the rest > yet. This new stats_1.out is 2k lines long, introduced for the tests > related to rewrites as an effect of 2PC. It seems to me that a split > into a new stats_rewrite would be justified for this case, to reduce > the output duplication. The first patch had an issue with some of the tests checking for dead tuples: if an autovacuum kicks in before querying the stats, we would get a dead tuple number of 0. So I have expanded the tests a bit to avoid autovacuum interactions, which should be enough to avoid noise, did a split into a new file, which should also be fine because we don't rely on a system-wide stats reset, then applied the result. The patch is spending a great deal of effort on three fronts: - making sure that the statistics are copied over after a relation rewrite. - making sure that we assign a "correct" object ID, assigning the fields of RelFileLocator based on a relation ID. Mapped and shared relations make the exercise a bit more difficult. It would be nice to avoid this kind of duplication with other code paths that assign a RelFileLocator. - Partitioned tables, where we don't have a relfilenode but we need to track statistics. The patch relies on the relation oid to assign a key, as far as I've read. Among the three points, the first one is the most invasive in the patch, it seems, and do we actually want to keep the stats across rewrites at all? The main reason of doing the relfilenode move would be to rebuild these stats on a WAL-record basis because the relfile locator is the only thing we know in the startup process, and once rewritten the state of the data is different. relation_needs_vacanalyze() then cares about three fields: - Number of dead tuples, which would be 0 after a rewrite. - ins_since_vacuum, which would be 0 after a rewrite. - mod_since_analyze, for analyze, again 0. I have not checked the recent autovacuum scheduling thread to see if this set changes there. Are these numbers worth the effort of copying over at the end? Was this particular point discussed? I've seen this mentioned once here, but I am wondering what are the arguments in favor of copying the stats data versus not copying it across rewrites: https://www.postgresql.org/message-id/20240607031736.7izmr2yirznvidka%40awork3.anarazel.de -- Michael Attachments: [application/pgp-signature] signature.asc (833B, 2-signature.asc) download ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-11-12 17:03 Bertrand Drouvot <[email protected]> parent: Michael Paquier <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2025-11-12 17:03 UTC (permalink / raw) To: Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Mon, Nov 10, 2025 at 05:53:45PM +0900, Michael Paquier wrote: > On Sun, Nov 09, 2025 at 08:33:54AM +0900, Michael Paquier wrote: > > Looking at this part of the patch set for now, not looked at the rest > > yet. This new stats_1.out is 2k lines long, introduced for the tests > > related to rewrites as an effect of 2PC. It seems to me that a split > > into a new stats_rewrite would be justified for this case, to reduce > > the output duplication. > > did a split into a new file, which should also be fine because we > don't rely on a system-wide stats reset, then applied the result. Thanks! > The patch is spending a great deal of effort on three fronts: > - making sure that the statistics are copied over after a relation > rewrite. Right, in 0003. > - making sure that we assign a "correct" object ID, assigning > the fields of RelFileLocator based on a relation ID. Mapped and > shared relations make the exercise a bit more difficult. It would be > nice to avoid this kind of duplication with other code paths that > assign a RelFileLocator. Are you referring to the new pgstat_reloid_to_relfilelocator() function? If so, I'll try to avoid code duplication with other code paths as suggested. > - Partitioned tables, where we don't have a relfilenode but we need to > track statistics. The patch relies on the relation oid to assign a > key, as far as I've read. Right. It's not doing that much in this area. It's needed so that things like "last_analyze" on a partitioned table is populated (see "Ensure only the partitioned table is analyzed" in vacuum.sql). > Among the three points, the first one is the most invasive in the > patch, it seems, and do we actually want to keep the stats across > rewrites at all? Not doing so would mean that all stats related to a relation will be lost after a rewrite. I think that would be a major regression as compared to the current behavior. > The main reason of doing the relfilenode move > would be to rebuild these stats on a WAL-record basis because the > relfile locator is the only thing we know in the startup process, and > once rewritten the state of the data is different. > relation_needs_vacanalyze() then cares about three fields: > - Number of dead tuples, which would be 0 after a rewrite. > - ins_since_vacuum, which would be 0 after a rewrite. > - mod_since_analyze, for analyze, again 0. > > I have not checked the recent autovacuum scheduling thread to see if > this set changes there. > > Are these numbers worth the effort of copying over at the end? I think so because that would impact all the other relation's stats (not only the ones linked to relation_needs_vacanalyze()). > Was > this particular point discussed? I've seen this mentioned once here, > but I am wondering what are the arguments in favor of copying the > stats data versus not copying it across rewrites: > https://www.postgresql.org/message-id/20240607031736.7izmr2yirznvidka%40awork3.anarazel.de In favor of copying, I would say: - no regression as compared to the current behavior. That means, for example, not breaking DBA's activities/decisions based on the pg_stat_all_tables fields after a rewrite. - a rewrite is not changing the number of dead tuples, ins_since_vacuum and mod_since_analyze. So, if don't copy those, then we'd change the relation_needs_vacanalyze() decision(s) as compared to the current one(s) for no reasons (as a rewrite has no impact on those). In favor of not copying, I would say make the code simpler. I'm in favor of copying but open to different point of views. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-12-15 16:29 Bertrand Drouvot <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2025-12-15 16:29 UTC (permalink / raw) To: Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Wed, Nov 12, 2025 at 05:03:55PM +0000, Bertrand Drouvot wrote: > In favor of not copying, I would say make the code simpler. > > I'm in favor of copying but open to different point of views. PFA a mandatory rebase. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-12-15 17:48 Andres Freund <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 2 replies; 24+ messages in thread From: Andres Freund @ 2025-12-15 17:48 UTC (permalink / raw) To: Bertrand Drouvot <[email protected]>; +Cc: Michael Paquier <[email protected]>; Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On 2025-12-15 16:29:18 +0000, Bertrand Drouvot wrote: > From 7908ba56cb8b6255b869af6be13077aa0315d5f1 Mon Sep 17 00:00:00 2001 > From: Bertrand Drouvot <[email protected]> > Date: Wed, 1 Oct 2025 09:45:26 +0000 > Subject: [PATCH v8 1/2] Key PGSTAT_KIND_RELATION by relfile locator > > This patch changes the key used for the PGSTAT_KIND_RELATION statistic kind. > Instead of the relation oid, it now relies on: > > - dboid (linked to RelFileLocator's dbOid) > - objoid which is the result of a new macro (namely RelFileLocatorToPgStatObjid()) > that computes an objoid based on the RelFileLocator's spcOid and the > RelFileLocator's relNumber. I think this needs to make more explicit that this works because the object ID now is a uint64, and that both the inputs are 32 bits. > That will allow us to add new stats (add writes counters) and ensure that some > counters (n_dead_tup and friends) are replicated. Yay. > The patch introduces pgstat_reloid_to_relfilelocator() to 1) avoid calling > RelationIdGetRelation() to get the relfilelocator based on the relation oid > and 2) handle the partitioned table case. > > Please note that: > > - when running pg_stat_have_stats('relation',...) we now need to be connected > to the database that hosts the relation. As pg_stat_have_stats() is not > documented publicly, then the changes done in 029_stats_restart.pl look > enough. That seems fine. > - this patch does not handle rewrites so some tests are failing. It's only > intent is to ease the review and should not be pushed without being > merged with the following patch that handles the rewrites. Makes sense. > diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c > index 62035b7f9c3..a9b2b4e1033 100644 > --- a/src/backend/access/heap/vacuumlazy.c > +++ b/src/backend/access/heap/vacuumlazy.c > @@ -961,8 +961,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params, > * soon in cases where the failsafe prevented significant amounts of heap > * vacuuming. > */ > - pgstat_report_vacuum(RelationGetRelid(rel), > - rel->rd_rel->relisshared, > + pgstat_report_vacuum(rel->rd_locator, > Max(vacrel->new_live_tuples, 0), > vacrel->recently_dead_tuples + > vacrel->missed_dead_tuples, Why not pass in the Relation itself? Given that we do that already for pgstat_report_analyze(), it seems like that'd be an improvement even independent of this change? > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c > index 1bd3924e35e..563a3697690 100644 > --- a/src/backend/postmaster/autovacuum.c > +++ b/src/backend/postmaster/autovacuum.c > @@ -2048,8 +2048,7 @@ do_autovacuum(void) > > /* Fetch reloptions and the pgstat entry for this table */ > relopts = extract_autovac_opts(tuple, pg_class_desc); > - tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared, > - relid); > + tabentry = pgstat_fetch_stat_tabentry_ext(relid); > > /* Check if it needs vacuum or analyze */ > relation_needs_vacanalyze(relid, relopts, classForm, tabentry, I don't think this is good - now do_autovacuum() will do a separate syscache lookup to fetch information the caller already has (due to the pgstat_reloid_to_relfilelocator() in pgstat_fetch_stat_tabentry_ext()). That's not too bad for things like viewing stats, but do_autovacuum() does this for every table in a database... > @@ -326,9 +363,26 @@ pgstat_report_analyze(Relation rel, > ts = GetCurrentTimestamp(); > elapsedtime = TimestampDifferenceMilliseconds(starttime, ts); > > + if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) > + locator = rel->rd_locator; > + else > + { > + /* > + * Partitioned tables don't have storage, so construct a synthetic > + * locator for statistics tracking. Use the relation OID as relNumber. > + * No collision with regular relations is possible because relNumbers > + * are also assigned from the pg_class OID space (see > + * GetNewRelFileNumber()), making each value unique across the > + * database regardless of spcOid. > + */ I don't think this is true as stated. Two reasons: 1) This afaict guarantees that the relfilenode will not class with oids, but it does *NOT* guarantee that it does not clash with other relfilenodes 2) Note that GetNewRelFileNumber() does *NOT* check for conflicts when creating a new relfilenode for an existing relation: * If the relfilenumber will also be used as the relation's OID, pass the * opened pg_class catalog, and this routine will guarantee that the result * is also an unused OID within pg_class. If the result is to be used only * as a relfilenumber for an existing relation, pass NULL for pg_class. Greetings, Andres Freund ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-12-16 07:33 Michael Paquier <[email protected]> parent: Andres Freund <[email protected]> 1 sibling, 2 replies; 24+ messages in thread From: Michael Paquier @ 2025-12-16 07:33 UTC (permalink / raw) To: Andres Freund <[email protected]>; +Cc: Bertrand Drouvot <[email protected]>; Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] On Mon, Dec 15, 2025 at 12:48:25PM -0500, Andres Freund wrote: > I don't think this is true as stated. Two reasons: > > 1) This afaict guarantees that the relfilenode will not clash with oids, but > it does *NOT* guarantee that it does not clash with other relfilenodes > > 2) Note that GetNewRelFileNumber() does *NOT* check for conflicts when > creating a new relfilenode for an existing relation: > * If the relfilenumber will also be used as the relation's OID, pass the > * opened pg_class catalog, and this routine will guarantee that the result > * is also an unused OID within pg_class. If the result is to be used only > * as a relfilenumber for an existing relation, pass NULL for pg_class. FWIW, I am also still troubled by the part of the proposed patch set where we are trying to hide the idea of a partitioned table has a relfilenode set by using its relid instead in the key for the data. This leads to a huge amount of complexity in the patch, mainly to store data for autovacuum that we do not need at the end: - autovacuum discards partitioned tables in do_autovacuum(), so the stats related to partitioned tables that we need to select the relations does not matter. - manual vacuums may include partitioned tables to extract its partitions, vacuum_rel() at the end discarding them. Well, stats don't matter anyway. We only need to attach three fields to let autovacuum know if a relation needs to run or not: dead_tuples, ins_since_vacuum, mod_since_analyze. Most the fields of PgStat_StatTabEntry make sense only for tables, few are required by indexes for pg_stat_all_indexes. Some fields actually make sense because they refer to on-disk files, mostly for pg_statio_all_tables (blocks_fetched, blocks_hit). Hence, why don't we split PgStat_StatTabEntry into three things from the start, even if it means to duplicate some of them? Say: - Table fields: includes [auto]vacuum/analyze data, block fields, fields of pg_stat_all_tables. - Index fields: no need for the [auto]vacuum/analyze time and counts, block fields, pg_stat_all_indexes fields. - Relfilenode fields: dead_tuples, ins_since_vacuum and mod_since_analyze. Does not apply to partitioned tables and indexes, only applies to tables. Provides a clean split, embrace the fact that these are the only three fields we need to worry about during recovery. -- Michael Attachments: [application/pgp-signature] signature.asc (833B, 2-signature.asc) download ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-12-16 10:22 Bertrand Drouvot <[email protected]> parent: Andres Freund <[email protected]> 1 sibling, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2025-12-16 10:22 UTC (permalink / raw) To: Andres Freund <[email protected]>; +Cc: Michael Paquier <[email protected]>; Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Mon, Dec 15, 2025 at 12:48:25PM -0500, Andres Freund wrote: > On 2025-12-15 16:29:18 +0000, Bertrand Drouvot wrote: > > From 7908ba56cb8b6255b869af6be13077aa0315d5f1 Mon Sep 17 00:00:00 2001 > > I think this needs to make more explicit that this works because the object ID > now is a uint64, and that both the inputs are 32 bits. Yeah, it's now added in the commit message (mentioning b14e9ce7d55c). > > diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c > > index 62035b7f9c3..a9b2b4e1033 100644 > > --- a/src/backend/access/heap/vacuumlazy.c > > +++ b/src/backend/access/heap/vacuumlazy.c > > @@ -961,8 +961,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params, > > * soon in cases where the failsafe prevented significant amounts of heap > > * vacuuming. > > */ > > - pgstat_report_vacuum(RelationGetRelid(rel), > > - rel->rd_rel->relisshared, > > + pgstat_report_vacuum(rel->rd_locator, > > Max(vacrel->new_live_tuples, 0), > > vacrel->recently_dead_tuples + > > vacrel->missed_dead_tuples, > > Why not pass in the Relation itself? Given that we do that already for > pgstat_report_analyze(), it seems like that'd be an improvement even > independent of this change? Makes sense, done in [1]. > > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c > > index 1bd3924e35e..563a3697690 100644 > > --- a/src/backend/postmaster/autovacuum.c > > +++ b/src/backend/postmaster/autovacuum.c > > @@ -2048,8 +2048,7 @@ do_autovacuum(void) > > > > /* Fetch reloptions and the pgstat entry for this table */ > > relopts = extract_autovac_opts(tuple, pg_class_desc); > > - tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared, > > - relid); > > + tabentry = pgstat_fetch_stat_tabentry_ext(relid); > > > > /* Check if it needs vacuum or analyze */ > > relation_needs_vacanalyze(relid, relopts, classForm, tabentry, > > I don't think this is good - now do_autovacuum() will do a separate syscache > lookup to fetch information the caller already has (due to the > pgstat_reloid_to_relfilelocator() in pgstat_fetch_stat_tabentry_ext()). That's > not too bad for things like viewing stats, but do_autovacuum() does this for > every table in a database... Good point. In the attached I added pgstat_fetch_stat_tabentry_by_locator(). It's called directly in do_autovacuum() and also in pgstat_fetch_stat_tabentry_ext(). I did not check if there are other places where we can call pgstat_fetch_stat_tabentry_by_locator() directly. I want first to validate this idea makes sense, does it? > I don't think this is true as stated. Two reasons: > > 1) This afaict guarantees that the relfilenode will not class with oids, but > it does *NOT* guarantee that it does not clash with other relfilenodes > 2) Note that GetNewRelFileNumber() does *NOT* check for conflicts when > creating a new relfilenode for an existing relation: > * If the relfilenumber will also be used as the relation's OID, pass the > * opened pg_class catalog, and this routine will guarantee that the result > * is also an unused OID within pg_class. If the result is to be used only > * as a relfilenumber for an existing relation, pass NULL for pg_class. Oh right, in case of OID wraparound. In the attached I added a new " #define PSEUDO_PARTITION_TABLE_SPCOID 1665 " to ensure uniqueness then. [1]: https://www.postgresql.org/message-id/flat/aUEA6UZZkDCQFgSA%40ip-10-97-1-34.eu-west-3.compute.intern... Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-12-16 10:24 Bertrand Drouvot <[email protected]> parent: Michael Paquier <[email protected]> 1 sibling, 0 replies; 24+ messages in thread From: Bertrand Drouvot @ 2025-12-16 10:24 UTC (permalink / raw) To: Michael Paquier <[email protected]>; +Cc: Andres Freund <[email protected]>; Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Tue, Dec 16, 2025 at 04:33:17PM +0900, Michael Paquier wrote: > > Hence, why don't we split PgStat_StatTabEntry into three things from > the start, even if it means to duplicate some of them? Say: > - Table fields: includes [auto]vacuum/analyze data, block fields, > fields of pg_stat_all_tables. > - Index fields: no need for the [auto]vacuum/analyze time and counts, > block fields, pg_stat_all_indexes fields. > - Relfilenode fields: dead_tuples, ins_since_vacuum and > mod_since_analyze. Does not apply to partitioned tables and indexes, > only applies to tables. Provides a clean split, embrace the fact that > these are the only three fields we need to worry about during > recovery. I think that the PSEUDO_PARTITION_TABLE_SPCOID just proposed in [1] approach is simple enough and solves the collision issue raised by Andres. I think I prefer the unified structure as proposed in the patch (though we may want to split tables and indexes later on). The reason is that it's easier to expose publicly. Indeed, at the very beginning of this thread, in v1, I created a new PGSTAT_KIND_RELFILENODE and had to make it coexist with PGSTAT_KIND_RELATION and that led to discussion on how we should expose them ([2]). [1]: https://www.postgresql.org/message-id/aUEyzoOJtrCLAEeT%40ip-10-97-1-34.eu-west-3.compute.internal [2]: https://www.postgresql.org/message-id/CA%2BTgmoZtwT6h%3DnyuQ1J9GNSrRyhf0fv7Ai6FzO%3DbH0C9Bf6tew%40ma... Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-12-16 15:39 Andres Freund <[email protected]> parent: Michael Paquier <[email protected]> 1 sibling, 1 reply; 24+ messages in thread From: Andres Freund @ 2025-12-16 15:39 UTC (permalink / raw) To: Michael Paquier <[email protected]>; +Cc: Bertrand Drouvot <[email protected]>; Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On 2025-12-16 16:33:17 +0900, Michael Paquier wrote: > On Mon, Dec 15, 2025 at 12:48:25PM -0500, Andres Freund wrote: > > I don't think this is true as stated. Two reasons: > > > > 1) This afaict guarantees that the relfilenode will not clash with oids, but > > it does *NOT* guarantee that it does not clash with other relfilenodes > > > > 2) Note that GetNewRelFileNumber() does *NOT* check for conflicts when > > creating a new relfilenode for an existing relation: > > * If the relfilenumber will also be used as the relation's OID, pass the > > * opened pg_class catalog, and this routine will guarantee that the result > > * is also an unused OID within pg_class. If the result is to be used only > > * as a relfilenumber for an existing relation, pass NULL for pg_class. > > FWIW, I am also still troubled by the part of the proposed patch set > where we are trying to hide the idea of a partitioned table has a > relfilenode set by using its relid instead in the key for the data. > This leads to a huge amount of complexity in the patch, mainly to > store data for autovacuum that we do not need at the end: > - autovacuum discards partitioned tables in do_autovacuum(), so the > stats related to partitioned tables that we need to select the > relations does not matter. I feel like that's an implementation wart that we ought to fix. It's not infrequently a problem that we don't automatically analyze partitioned tables. Weren't there even a couple threads on that on the list in the last weeks? > - manual vacuums may include partitioned tables to extract its > partitions, vacuum_rel() at the end discarding them. Well, stats > don't matter anyway. > > We only need to attach three fields to let autovacuum know if a > relation needs to run or not: dead_tuples, ins_since_vacuum, > mod_since_analyze. That may be true for autovacuum today, but I don't see any reason for live_tuples, tuples_inserted etc to be inaccurate after a failover. > Most the fields of PgStat_StatTabEntry make sense > only for tables, few are required by indexes for pg_stat_all_indexes. > Some fields actually make sense because they refer to on-disk files, > mostly for pg_statio_all_tables (blocks_fetched, blocks_hit). > > Hence, why don't we split PgStat_StatTabEntry into three things from > the start, even if it means to duplicate some of them? Say: > - Table fields: includes [auto]vacuum/analyze data, block fields, > fields of pg_stat_all_tables. What do you mean with "block fields"? pg_statio_all_tables? If so, what's the point of including them here, rather than in the relfilenode fields? > - Index fields: no need for the [auto]vacuum/analyze time and counts, > block fields, pg_stat_all_indexes fields. I think we actually should populate the [auto]vac fields for indexes, right now it's impossible to figure out from stats whether indexes are frequently scanned as part of vacuum or not. > - Relfilenode fields: dead_tuples, ins_since_vacuum and > mod_since_analyze. Does not apply to partitioned tables and indexes, > only applies to tables. Provides a clean split, embrace the fact that > these are the only three fields we need to worry about during > recovery. I think we really ought to populate not just these during recovery, but also at least n_tup_ins, n_tup_upd, n_tup_del, n_tup_hot_upd, n_live_tup. I don't understand why we would want to only populate these three fields? I'm not against splitting the index fields off, but it seems pretty orthogonal to what we're discussing here. If we were to split of index stats into a separate stat, why wouldn't we keep the statio fields in the relfilenode stats, since they're obviously intimately tied to that? Greetings, Andres Freund ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2025-12-17 07:30 Bertrand Drouvot <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 0 replies; 24+ messages in thread From: Bertrand Drouvot @ 2025-12-17 07:30 UTC (permalink / raw) To: Andres Freund <[email protected]>; +Cc: Michael Paquier <[email protected]>; Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Tue, Dec 16, 2025 at 10:22:06AM +0000, Bertrand Drouvot wrote: > In the attached PFA a mandatory rebase due to f4e797171ea. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2026-01-13 09:29 Bertrand Drouvot <[email protected]> parent: Andres Freund <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2026-01-13 09:29 UTC (permalink / raw) To: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Tue, Dec 16, 2025 at 10:39:15AM -0500, Andres Freund wrote: > On 2025-12-16 16:33:17 +0900, Michael Paquier wrote: > > > > FWIW, I am also still troubled by the part of the proposed patch set > > where we are trying to hide the idea of a partitioned table has a > > relfilenode set by using its relid instead in the key for the data. > > This leads to a huge amount of complexity in the patch, mainly to > > store data for autovacuum that we do not need at the end: > > - autovacuum discards partitioned tables in do_autovacuum(), so the > > stats related to partitioned tables that we need to select the > > relations does not matter. > > I feel like that's an implementation wart that we ought to fix. It's not > infrequently a problem that we don't automatically analyze partitioned > tables. Weren't there even a couple threads on that on the list in the last > weeks? > > > > - manual vacuums may include partitioned tables to extract its > > partitions, vacuum_rel() at the end discarding them. Well, stats > > don't matter anyway. > > > > We only need to attach three fields to let autovacuum know if a > > relation needs to run or not: dead_tuples, ins_since_vacuum, > > mod_since_analyze. > > That may be true for autovacuum today, but I don't see any reason for > live_tuples, tuples_inserted etc to be inaccurate after a failover. > > > Most the fields of PgStat_StatTabEntry make sense > > only for tables, few are required by indexes for pg_stat_all_indexes. > > Some fields actually make sense because they refer to on-disk files, > > mostly for pg_statio_all_tables (blocks_fetched, blocks_hit). > > > > Hence, why don't we split PgStat_StatTabEntry into three things from > > the start, even if it means to duplicate some of them? Say: > > - Table fields: includes [auto]vacuum/analyze data, block fields, > > fields of pg_stat_all_tables. > > What do you mean with "block fields"? pg_statio_all_tables? If so, what's the > point of including them here, rather than in the relfilenode fields? > > > > - Index fields: no need for the [auto]vacuum/analyze time and counts, > > block fields, pg_stat_all_indexes fields. > > I think we actually should populate the [auto]vac fields for indexes, right > now it's impossible to figure out from stats whether indexes are frequently > scanned as part of vacuum or not. > > > > - Relfilenode fields: dead_tuples, ins_since_vacuum and > > mod_since_analyze. Does not apply to partitioned tables and indexes, > > only applies to tables. Provides a clean split, embrace the fact that > > these are the only three fields we need to worry about during > > recovery. > > I think we really ought to populate not just these during recovery, but also > at least n_tup_ins, n_tup_upd, n_tup_del, n_tup_hot_upd, n_live_tup. > > I don't understand why we would want to only populate these three fields? > > > I'm not against splitting the index fields off, but it seems pretty orthogonal > to what we're discussing here. If we were to split of index stats into a > separate stat, why wouldn't we keep the statio fields in the relfilenode > stats, since they're obviously intimately tied to that? Andres, Michael, let me try to sum up my understanding of the current state and see how we could now move forward. First of all, I understand that you both think that the patch outcome will be useful to have. The current debate is about the design, the current status is: - Andres raised specific technical/implementation concerns and I've proposed solutions in [1]. It also looks like Andres supports the overall design approach. - Michael is not really ok with the current design approach. That means, that with the current design in place, Michael would probably not commit it (even after review(s)). Given that I'm also in favor of the current proposed design, this raises the questions: - Andres, would you commit such a patch (after review iteration(s) of course)? - Michael, if Andres is ok with the above, would you still offer your help for the review part (even if the design is not what you "prefer"/"like")? [1]: https://postgr.es/m/aUEyzoOJtrCLAEeT%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2026-02-23 06:39 Bertrand Drouvot <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2026-02-23 06:39 UTC (permalink / raw) To: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Tue, Jan 13, 2026 at 09:29:02AM +0000, Bertrand Drouvot wrote: > Hi, > > On Tue, Dec 16, 2025 at 10:39:15AM -0500, Andres Freund wrote: > > On 2025-12-16 16:33:17 +0900, Michael Paquier wrote: > > Andres, Michael, let me try to sum up my understanding of the current state > and see how we could now move forward. > > First of all, I understand that you both think that the patch outcome will be > useful to have. The current debate is about the design, the current status is: > > - Andres raised specific technical/implementation concerns and I've proposed > solutions in [1]. It also looks like Andres supports the overall design approach. > - Michael is not really ok with the current design approach. > > That means, that with the current design in place, Michael would probably not > commit it (even after review(s)). > > Given that I'm also in favor of the current proposed design, this raises the > questions: > > - Andres, would you commit such a patch (after review iteration(s) of course)? > - Michael, if Andres is ok with the above, would you still offer your help for the > review part (even if the design is not what you "prefer"/"like")? > > [1]: https://postgr.es/m/aUEyzoOJtrCLAEeT%40ip-10-97-1-34.eu-west-3.compute.internal PFA, tiny rebase due to 9842e8aca09. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com Attachments: [text/x-diff] v11-0001-Key-PGSTAT_KIND_RELATION-by-relfile-locator.patch (27.6K, 2-v11-0001-Key-PGSTAT_KIND_RELATION-by-relfile-locator.patch) download | inline diff: From 978590dadd55d83fbbc7b81c0588fea6df4fdd60 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Wed, 1 Oct 2025 09:45:26 +0000 Subject: [PATCH v11 1/2] Key PGSTAT_KIND_RELATION by relfile locator This patch changes the key used for the PGSTAT_KIND_RELATION statistic kind. Instead of the relation oid, it now relies on: - dboid (linked to RelFileLocator's dbOid) - objoid which is the result of a new macro (namely RelFileLocatorToPgStatObjid()) that computes an objoid based on the RelFileLocator's spcOid and the RelFileLocator's relNumber. This is possible as, since b14e9ce7d55c, the objoid is now uint64 and spcOid and relNumber are 32 bits. That will allow us to add new stats (add writes counters) and ensure that some counters (n_dead_tup and friends) are replicated. The patch introduces pgstat_reloid_to_relfilelocator() to 1) avoid calling RelationIdGetRelation() to get the relfilelocator based on the relation oid and 2) handle the partitioned table case. Please note that: - when running pg_stat_have_stats('relation',...) we now need to be connected to the database that hosts the relation. As pg_stat_have_stats() is not documented publicly, then the changes done in 029_stats_restart.pl look enough. - this patch does not handle rewrites so some tests are failing. It's only intent is to ease the review and should not be pushed without being merged with the following patch that handles the rewrites. - it can be used to test that stats are incremented correctly and that we're able to retrieve them as long as rewrites are not involved. --- src/backend/postmaster/autovacuum.c | 17 +- src/backend/utils/activity/pgstat_relation.c | 236 ++++++++++++++++--- src/backend/utils/adt/pgstatfuncs.c | 22 +- src/include/catalog/pg_tablespace.dat | 4 + src/include/catalog/pg_tablespace.h | 8 + src/include/pgstat.h | 15 +- src/include/utils/pgstat_internal.h | 1 + src/test/recovery/t/029_stats_restart.pl | 40 ++-- 8 files changed, 271 insertions(+), 72 deletions(-) 6.1% src/backend/postmaster/ 61.6% src/backend/utils/activity/ 5.1% src/backend/utils/adt/ 3.2% src/include/catalog/ 5.6% src/include/ 18.1% src/test/recovery/t/ diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index 6fde740465f..4bd5a6824bd 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -1990,12 +1990,16 @@ do_autovacuum(void) bool dovacuum; bool doanalyze; bool wraparound; + RelFileLocator locator; if (classForm->relkind != RELKIND_RELATION && classForm->relkind != RELKIND_MATVIEW) continue; relid = classForm->oid; + locator.dbOid = classForm->relisshared ? InvalidOid : MyDatabaseId; + locator.spcOid = classForm->reltablespace; + locator.relNumber = classForm->relfilenode; /* * Check if it is a temp table (presumably, of some other backend's). @@ -2024,8 +2028,7 @@ do_autovacuum(void) /* Fetch reloptions and the pgstat entry for this table */ relopts = extract_autovac_opts(tuple, pg_class_desc); - tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared, - relid); + tabentry = pgstat_fetch_stat_tabentry_by_locator(locator); /* Check if it needs vacuum or analyze */ relation_needs_vacanalyze(relid, relopts, classForm, tabentry, @@ -2090,6 +2093,7 @@ do_autovacuum(void) bool dovacuum; bool doanalyze; bool wraparound; + RelFileLocator locator; /* * We cannot safely process other backends' temp tables, so skip 'em. @@ -2098,6 +2102,9 @@ do_autovacuum(void) continue; relid = classForm->oid; + locator.dbOid = classForm->relisshared ? InvalidOid : MyDatabaseId; + locator.spcOid = classForm->reltablespace; + locator.relNumber = classForm->relfilenode; /* * fetch reloptions -- if this toast table does not have them, try the @@ -2117,8 +2124,7 @@ do_autovacuum(void) } /* Fetch the pgstat entry for this table */ - tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared, - relid); + tabentry = pgstat_fetch_stat_tabentry_by_locator(locator); relation_needs_vacanalyze(relid, relopts, classForm, tabentry, effective_multixact_freeze_max_age, @@ -2915,8 +2921,7 @@ recheck_relation_needs_vacanalyze(Oid relid, PgStat_StatTabEntry *tabentry; /* fetch the pgstat table entry */ - tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared, - relid); + tabentry = pgstat_fetch_stat_tabentry_ext(relid); relation_needs_vacanalyze(relid, avopts, classForm, tabentry, effective_multixact_freeze_max_age, diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c index bc8c43b96aa..89bf0cbed56 100644 --- a/src/backend/utils/activity/pgstat_relation.c +++ b/src/backend/utils/activity/pgstat_relation.c @@ -17,12 +17,17 @@ #include "postgres.h" +#include "access/htup_details.h" #include "access/twophase_rmgr.h" #include "access/xact.h" #include "catalog/catalog.h" +#include "catalog/pg_tablespace.h" +#include "storage/lmgr.h" #include "utils/memutils.h" #include "utils/pgstat_internal.h" #include "utils/rel.h" +#include "utils/relmapper.h" +#include "utils/syscache.h" #include "utils/timestamp.h" @@ -36,13 +41,12 @@ typedef struct TwoPhasePgStatRecord PgStat_Counter inserted_pre_truncdrop; PgStat_Counter updated_pre_truncdrop; PgStat_Counter deleted_pre_truncdrop; - Oid id; /* table's OID */ - bool shared; /* is it a shared catalog? */ + RelFileLocator locator; /* table's rd_locator */ bool truncdropped; /* was the relation truncated/dropped? */ } TwoPhasePgStatRecord; -static PgStat_TableStatus *pgstat_prep_relation_pending(Oid rel_id, bool isshared); +static PgStat_TableStatus *pgstat_prep_relation_pending(RelFileLocator locator); static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level); static void ensure_tabstat_xact_level(PgStat_TableStatus *pgstat_info); static void save_truncdrop_counters(PgStat_TableXactStatus *trans, bool is_drop); @@ -60,8 +64,7 @@ pgstat_copy_relation_stats(Relation dst, Relation src) PgStatShared_Relation *dstshstats; PgStat_EntryRef *dst_ref; - srcstats = pgstat_fetch_stat_tabentry_ext(src->rd_rel->relisshared, - RelationGetRelid(src)); + srcstats = pgstat_fetch_stat_tabentry_ext(RelationGetRelid(src)); if (!srcstats) return; @@ -94,8 +97,10 @@ pgstat_init_relation(Relation rel) /* * We only count stats for relations with storage and partitioned tables + * and we don't count stats generated during a rewrite. */ - if (!RELKIND_HAS_STORAGE(relkind) && relkind != RELKIND_PARTITIONED_TABLE) + if ((!RELKIND_HAS_STORAGE(relkind) && relkind != RELKIND_PARTITIONED_TABLE) || + OidIsValid(rel->rd_rel->relrewrite)) { rel->pgstat_enabled = false; rel->pgstat_info = NULL; @@ -130,12 +135,37 @@ pgstat_init_relation(Relation rel) void pgstat_assoc_relation(Relation rel) { + RelFileLocator locator; + Assert(rel->pgstat_enabled); Assert(rel->pgstat_info == NULL); + /* + * Don't associate stats for relations without storage and non partitioned + * tables. + */ + if (!RELKIND_HAS_STORAGE(rel->rd_rel->relkind) && + rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) + return; + + if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) + locator = rel->rd_locator; + else + { + /* + * Partitioned tables don't have storage, so construct a synthetic + * locator for statistics tracking. Use a reserved pseudo tablespace + * OID that cannot conflict with real tablespaces, and the relation + * OID as relNumber. This ensures no collision with regular relations + * even after OID wraparound. + */ + locator.dbOid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId); + locator.spcOid = PSEUDO_PARTITION_TABLE_SPCOID; + locator.relNumber = rel->rd_id; + } + /* Else find or make the PgStat_TableStatus entry, and update link */ - rel->pgstat_info = pgstat_prep_relation_pending(RelationGetRelid(rel), - rel->rd_rel->relisshared); + rel->pgstat_info = pgstat_prep_relation_pending(locator); /* don't allow link a stats to multiple relcache entries */ Assert(rel->pgstat_info->relation == NULL); @@ -167,9 +197,13 @@ pgstat_unlink_relation(Relation rel) void pgstat_create_relation(Relation rel) { + /* don't track stats for relations without storage */ + if (!RELKIND_HAS_STORAGE(rel->rd_rel->relkind)) + return; + pgstat_create_transactional(PGSTAT_KIND_RELATION, - rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId, - RelationGetRelid(rel)); + rel->rd_locator.dbOid, + RelFileLocatorToPgStatObjid(rel->rd_locator)); } /* @@ -181,9 +215,13 @@ pgstat_drop_relation(Relation rel) int nest_level = GetCurrentTransactionNestLevel(); PgStat_TableStatus *pgstat_info; + /* don't track stats for relations without storage */ + if (!RELKIND_HAS_STORAGE(rel->rd_rel->relkind)) + return; + pgstat_drop_transactional(PGSTAT_KIND_RELATION, - rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId, - RelationGetRelid(rel)); + rel->rd_locator.dbOid, + RelFileLocatorToPgStatObjid(rel->rd_locator)); if (!pgstat_should_count_relation(rel)) return; @@ -213,20 +251,23 @@ pgstat_report_vacuum(Relation rel, PgStat_Counter livetuples, PgStat_EntryRef *entry_ref; PgStatShared_Relation *shtabentry; PgStat_StatTabEntry *tabentry; - Oid dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId); TimestampTz ts; PgStat_Counter elapsedtime; + RelFileLocator locator; if (!pgstat_track_counts) return; + locator = rel->rd_locator; + /* Store the data in the table's hash table entry. */ ts = GetCurrentTimestamp(); elapsedtime = TimestampDifferenceMilliseconds(starttime, ts); /* block acquiring lock for the same reason as pgstat_report_autovac() */ - entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_RELATION, dboid, - RelationGetRelid(rel), false); + entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_RELATION, locator.dbOid, + RelFileLocatorToPgStatObjid(locator), + false); shtabentry = (PgStatShared_Relation *) entry_ref->shared_stats; tabentry = &shtabentry->stats; @@ -285,9 +326,9 @@ pgstat_report_analyze(Relation rel, PgStat_EntryRef *entry_ref; PgStatShared_Relation *shtabentry; PgStat_StatTabEntry *tabentry; - Oid dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId); TimestampTz ts; PgStat_Counter elapsedtime; + RelFileLocator locator; if (!pgstat_track_counts) return; @@ -325,9 +366,25 @@ pgstat_report_analyze(Relation rel, ts = GetCurrentTimestamp(); elapsedtime = TimestampDifferenceMilliseconds(starttime, ts); + if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) + locator = rel->rd_locator; + else + { + /* + * Partitioned tables don't have storage, so construct a synthetic + * locator for statistics tracking. Use a reserved pseudo tablespace + * OID that cannot conflict with real tablespaces, and the relation + * OID as relNumber. This ensures no collision with regular relations + * even after OID wraparound. + */ + locator.dbOid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId); + locator.spcOid = PSEUDO_PARTITION_TABLE_SPCOID; + locator.relNumber = rel->rd_id; + } /* block acquiring lock for the same reason as pgstat_report_autovac() */ - entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_RELATION, dboid, - RelationGetRelid(rel), + entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_RELATION, + locator.dbOid, + RelFileLocatorToPgStatObjid(locator), false); /* can't get dropped while accessed */ Assert(entry_ref != NULL && entry_ref->shared_stats != NULL); @@ -468,7 +525,16 @@ pgstat_update_heap_dead_tuples(Relation rel, int delta) PgStat_StatTabEntry * pgstat_fetch_stat_tabentry(Oid relid) { - return pgstat_fetch_stat_tabentry_ext(IsSharedRelation(relid), relid); + return pgstat_fetch_stat_tabentry_ext(relid); +} + +PgStat_StatTabEntry * +pgstat_fetch_stat_tabentry_by_locator(RelFileLocator locator) +{ + return (PgStat_StatTabEntry *) pgstat_fetch_entry( + PGSTAT_KIND_RELATION, + locator.dbOid, + RelFileLocatorToPgStatObjid(locator)); } /* @@ -476,12 +542,14 @@ pgstat_fetch_stat_tabentry(Oid relid) * whether the to-be-accessed table is a shared relation or not. */ PgStat_StatTabEntry * -pgstat_fetch_stat_tabentry_ext(bool shared, Oid reloid) +pgstat_fetch_stat_tabentry_ext(Oid reloid) { - Oid dboid = (shared ? InvalidOid : MyDatabaseId); + RelFileLocator locator; - return (PgStat_StatTabEntry *) - pgstat_fetch_entry(PGSTAT_KIND_RELATION, dboid, reloid); + if (!pgstat_reloid_to_relfilelocator(reloid, &locator)) + return NULL; + + return pgstat_fetch_stat_tabentry_by_locator(locator); } /* @@ -503,14 +571,17 @@ find_tabstat_entry(Oid rel_id) PgStat_TableXactStatus *trans; PgStat_TableStatus *tabentry = NULL; PgStat_TableStatus *tablestatus = NULL; + RelFileLocator locator; + + if (!pgstat_reloid_to_relfilelocator(rel_id, &locator)) + return NULL; + + entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, + locator.dbOid, + RelFileLocatorToPgStatObjid(locator)); - entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, MyDatabaseId, rel_id); if (!entry_ref) - { - entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, InvalidOid, rel_id); - if (!entry_ref) - return tablestatus; - } + return tablestatus; tabentry = (PgStat_TableStatus *) entry_ref->pending; tablestatus = palloc_object(PgStat_TableStatus); @@ -706,8 +777,12 @@ AtPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state) record.inserted_pre_truncdrop = trans->inserted_pre_truncdrop; record.updated_pre_truncdrop = trans->updated_pre_truncdrop; record.deleted_pre_truncdrop = trans->deleted_pre_truncdrop; - record.id = tabstat->id; - record.shared = tabstat->shared; + + if (tabstat->relation != NULL) + record.locator = tabstat->relation->rd_locator; + else + record.locator = tabstat->locator; + record.truncdropped = trans->truncdropped; RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0, @@ -750,7 +825,7 @@ pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info, PgStat_TableStatus *pgstat_info; /* Find or create a tabstat entry for the rel */ - pgstat_info = pgstat_prep_relation_pending(rec->id, rec->shared); + pgstat_info = pgstat_prep_relation_pending(rec->locator); /* Same math as in AtEOXact_PgStat, commit case */ pgstat_info->counts.tuples_inserted += rec->tuples_inserted; @@ -785,8 +860,8 @@ pgstat_twophase_postabort(FullTransactionId fxid, uint16 info, TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata; PgStat_TableStatus *pgstat_info; - /* Find or create a tabstat entry for the rel */ - pgstat_info = pgstat_prep_relation_pending(rec->id, rec->shared); + /* Find or create a tabstat entry for the target locator */ + pgstat_info = pgstat_prep_relation_pending(rec->locator); /* Same math as in AtEOXact_PgStat, abort case */ if (rec->truncdropped) @@ -920,17 +995,21 @@ pgstat_relation_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts) * initialized if not exists. */ static PgStat_TableStatus * -pgstat_prep_relation_pending(Oid rel_id, bool isshared) +pgstat_prep_relation_pending(RelFileLocator locator) { PgStat_EntryRef *entry_ref; PgStat_TableStatus *pending; + uint64 objid; + + objid = RelFileLocatorToPgStatObjid(locator); entry_ref = pgstat_prep_pending_entry(PGSTAT_KIND_RELATION, - isshared ? InvalidOid : MyDatabaseId, - rel_id, NULL); + locator.dbOid, + objid, NULL); + pending = entry_ref->pending; - pending->id = rel_id; - pending->shared = isshared; + pending->id = objid; + pending->locator = locator; return pending; } @@ -1009,3 +1088,82 @@ restore_truncdrop_counters(PgStat_TableXactStatus *trans) trans->tuples_deleted = trans->deleted_pre_truncdrop; } } + +/* + * Convert a relation OID to its corresponding RelFileLocator for statistics + * tracking purposes. + * + * Returns true on success, false if the relation doesn't need statistics + * tracking. + * + * For partitioned tables, constructs a synthetic locator using the relation + * OID as relNumber, since they don't have storage. + */ +bool +pgstat_reloid_to_relfilelocator(Oid reloid, RelFileLocator *locator) +{ + HeapTuple tuple; + Form_pg_class relform; + bool result = true; + + /* get the relation's tuple from pg_class */ + tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(reloid)); + + if (!HeapTupleIsValid(tuple)) + return false; + + relform = (Form_pg_class) GETSTRUCT(tuple); + + /* skip relations without storage and non partitioned tables */ + if (!RELKIND_HAS_STORAGE(relform->relkind) && + relform->relkind != RELKIND_PARTITIONED_TABLE) + { + ReleaseSysCache(tuple); + return false; + } + + if (relform->relkind != RELKIND_PARTITIONED_TABLE) + { + /* build the RelFileLocator */ + locator->relNumber = relform->relfilenode; + locator->spcOid = relform->reltablespace; + + /* handle default tablespace */ + if (!OidIsValid(locator->spcOid)) + locator->spcOid = MyDatabaseTableSpace; + + /* handle dbOid for global vs local relations */ + if (locator->spcOid == GLOBALTABLESPACE_OID) + locator->dbOid = InvalidOid; + else + locator->dbOid = MyDatabaseId; + + /* handle mapped relations */ + if (!RelFileNumberIsValid(locator->relNumber)) + { + locator->relNumber = RelationMapOidToFilenumber(reloid, + relform->relisshared); + if (!RelFileNumberIsValid(locator->relNumber)) + { + ReleaseSysCache(tuple); + return false; + } + } + } + else + { + /* + * Partitioned tables don't have storage, so construct a synthetic + * locator for statistics tracking. Use a reserved pseudo tablespace + * OID that cannot conflict with real tablespaces, and the relation + * OID as relNumber. This ensures no collision with regular relations + * even after OID wraparound. + */ + locator->dbOid = (relform->relisshared ? InvalidOid : MyDatabaseId); + locator->spcOid = PSEUDO_PARTITION_TABLE_SPCOID; + locator->relNumber = relform->oid; + } + + ReleaseSysCache(tuple); + return result; +} diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index b1df96e7b0b..a5a00d6f018 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -23,13 +23,13 @@ #include "common/ip.h" #include "funcapi.h" #include "miscadmin.h" -#include "pgstat.h" #include "postmaster/bgworker.h" #include "replication/logicallauncher.h" #include "storage/proc.h" #include "storage/procarray.h" #include "utils/acl.h" #include "utils/builtins.h" +#include "utils/pgstat_internal.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1961,9 +1961,14 @@ Datum pg_stat_reset_single_table_counters(PG_FUNCTION_ARGS) { Oid taboid = PG_GETARG_OID(0); - Oid dboid = (IsSharedRelation(taboid) ? InvalidOid : MyDatabaseId); + RelFileLocator locator; - pgstat_reset(PGSTAT_KIND_RELATION, dboid, taboid); + /* Get the stats locator from the relation OID */ + if (!pgstat_reloid_to_relfilelocator(taboid, &locator)) + PG_RETURN_VOID(); + + pgstat_reset(PGSTAT_KIND_RELATION, locator.dbOid, + RelFileLocatorToPgStatObjid(locator)); PG_RETURN_VOID(); } @@ -2317,5 +2322,16 @@ pg_stat_have_stats(PG_FUNCTION_ARGS) uint64 objid = PG_GETARG_INT64(2); PgStat_Kind kind = pgstat_get_kind_from_str(stats_type); + /* Convert relation OID to relfilenode objid */ + if (kind == PGSTAT_KIND_RELATION) + { + RelFileLocator locator; + + if (!pgstat_reloid_to_relfilelocator(objid, &locator)) + PG_RETURN_BOOL(false); + + objid = RelFileLocatorToPgStatObjid(locator); + } + PG_RETURN_BOOL(pgstat_have_entry(kind, dboid, objid)); } diff --git a/src/include/catalog/pg_tablespace.dat b/src/include/catalog/pg_tablespace.dat index c4cde415219..73ed046be31 100644 --- a/src/include/catalog/pg_tablespace.dat +++ b/src/include/catalog/pg_tablespace.dat @@ -10,6 +10,10 @@ # #---------------------------------------------------------------------- +/* + * When adding a new one, ensure it does not conflict with + * PSEUDO_PARTITION_TABLE_SPCOID. + */ [ { oid => '1663', oid_symbol => 'DEFAULTTABLESPACE_OID', diff --git a/src/include/catalog/pg_tablespace.h b/src/include/catalog/pg_tablespace.h index fe7a5ab538f..f511adc3965 100644 --- a/src/include/catalog/pg_tablespace.h +++ b/src/include/catalog/pg_tablespace.h @@ -21,6 +21,14 @@ #include "catalog/genbki.h" #include "catalog/pg_tablespace_d.h" /* IWYU pragma: export */ +/* + * Reserved tablespace OID for partitioned table pseudo locators. + * This is not an actual tablespace, just a reserved value to distinguish + * partitioned table statistics from regular table statistics. Ensures it does + * not conflict with the ones in pg_tablespace.dat. + */ +#define PSEUDO_PARTITION_TABLE_SPCOID 1665 + /* ---------------- * pg_tablespace definition. cpp turns this into * typedef struct FormData_pg_tablespace diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 9bb777c3d5a..e9e4d32c3b8 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -16,6 +16,7 @@ #include "portability/instr_time.h" #include "postmaster/pgarch.h" /* for MAX_XFN_CHARS */ #include "replication/conflict.h" +#include "storage/relfilelocator.h" #include "utils/backend_progress.h" /* for backward compatibility */ /* IWYU pragma: export */ #include "utils/backend_status.h" /* for backward compatibility */ /* IWYU pragma: export */ #include "utils/pgstat_kind.h" @@ -34,6 +35,12 @@ /* Default directory to store temporary statistics data in */ #define PG_STAT_TMP_DIR "pg_stat_tmp" +/* + * Build a pgstat key Objid based on a RelFileLocator. + */ +#define RelFileLocatorToPgStatObjid(locator) \ + (((uint64) (locator).spcOid << 32) | (locator).relNumber) + /* Values for track_functions GUC variable --- order is significant! */ typedef enum TrackFunctionsLevel { @@ -174,11 +181,11 @@ typedef struct PgStat_TableCounts */ typedef struct PgStat_TableStatus { - Oid id; /* table's OID */ - bool shared; /* is it a shared catalog? */ + uint64 id; /* hash of relfilelocator for stats key */ struct PgStat_TableXactStatus *trans; /* lowest subxact's counts */ PgStat_TableCounts counts; /* event counts to be sent */ Relation relation; /* rel that is using this entry */ + RelFileLocator locator; /* table's relfilelocator */ } PgStat_TableStatus; /* ---------- @@ -734,8 +741,8 @@ extern void pgstat_twophase_postabort(FullTransactionId fxid, uint16 info, void *recdata, uint32 len); extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid); -extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_ext(bool shared, - Oid reloid); +extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_by_locator(RelFileLocator locator); +extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_ext(Oid reloid); extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id); diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h index 9b8fbae00ed..42f29496551 100644 --- a/src/include/utils/pgstat_internal.h +++ b/src/include/utils/pgstat_internal.h @@ -765,6 +765,7 @@ extern void PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state); extern bool pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait); extern void pgstat_relation_delete_pending_cb(PgStat_EntryRef *entry_ref); extern void pgstat_relation_reset_timestamp_cb(PgStatShared_Common *header, TimestampTz ts); +extern bool pgstat_reloid_to_relfilelocator(Oid reloid, RelFileLocator *locator); /* diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl index cdc427dbc78..4d00087dc6f 100644 --- a/src/test/recovery/t/029_stats_restart.pl +++ b/src/test/recovery/t/029_stats_restart.pl @@ -55,10 +55,10 @@ trigger_funcrel_stat(); # verify stats objects exist $sect = "initial"; -is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist"); -is(have_stats('function', $dboid, $funcoid), +is(have_stats($connect_db, 'database', $dboid, 0), 't', "$sect: db stats do exist"); +is(have_stats($db_under_test, 'function', $dboid, $funcoid), 't', "$sect: function stats do exist"); -is(have_stats('relation', $dboid, $tableoid), +is(have_stats($db_under_test, 'relation', $dboid, $tableoid), 't', "$sect: relation stats do exist"); # regular shutdown @@ -79,10 +79,10 @@ copy($og_stats, $statsfile) or die "Copy failed: $!"; $node->start; $sect = "copy"; -is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist"); -is(have_stats('function', $dboid, $funcoid), +is(have_stats($connect_db, 'database', $dboid, 0), 't', "$sect: db stats do exist"); +is(have_stats($db_under_test, 'function', $dboid, $funcoid), 't', "$sect: function stats do exist"); -is(have_stats('relation', $dboid, $tableoid), +is(have_stats($db_under_test, 'relation', $dboid, $tableoid), 't', "$sect: relation stats do exist"); $node->stop('immediate'); @@ -96,10 +96,10 @@ $node->start; # stats should have been discarded $sect = "post immediate"; -is(have_stats('database', $dboid, 0), 'f', "$sect: db stats do not exist"); -is(have_stats('function', $dboid, $funcoid), +is(have_stats($connect_db, 'database', $dboid, 0), 'f', "$sect: db stats do not exist"); +is(have_stats($db_under_test, 'function', $dboid, $funcoid), 'f', "$sect: function stats do exist"); -is(have_stats('relation', $dboid, $tableoid), +is(have_stats($db_under_test, 'relation', $dboid, $tableoid), 'f', "$sect: relation stats do not exist"); # get rid of backup statsfile @@ -110,10 +110,10 @@ unlink $statsfile or die "cannot unlink $statsfile $!"; trigger_funcrel_stat(); $sect = "post immediate, new"; -is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist"); -is(have_stats('function', $dboid, $funcoid), +is(have_stats($connect_db, 'database', $dboid, 0), 't', "$sect: db stats do exist"); +is(have_stats($db_under_test, 'function', $dboid, $funcoid), 't', "$sect: function stats do exist"); -is(have_stats('relation', $dboid, $tableoid), +is(have_stats($db_under_test, 'relation', $dboid, $tableoid), 't', "$sect: relation stats do exist"); # regular shutdown @@ -129,10 +129,10 @@ $node->start; # no stats present due to invalid stats file $sect = "invalid_overwrite"; -is(have_stats('database', $dboid, 0), 'f', "$sect: db stats do not exist"); -is(have_stats('function', $dboid, $funcoid), +is(have_stats($connect_db, 'database', $dboid, 0), 'f', "$sect: db stats do not exist"); +is(have_stats($db_under_test, 'function', $dboid, $funcoid), 'f', "$sect: function stats do not exist"); -is(have_stats('relation', $dboid, $tableoid), +is(have_stats($db_under_test, 'relation', $dboid, $tableoid), 'f', "$sect: relation stats do not exist"); @@ -145,10 +145,10 @@ append_file($og_stats, "XYZ"); $node->start; $sect = "invalid_append"; -is(have_stats('database', $dboid, 0), 'f', "$sect: db stats do not exist"); -is(have_stats('function', $dboid, $funcoid), +is(have_stats($connect_db, 'database', $dboid, 0), 'f', "$sect: db stats do not exist"); +is(have_stats($db_under_test, 'function', $dboid, $funcoid), 'f', "$sect: function stats do not exist"); -is(have_stats('relation', $dboid, $tableoid), +is(have_stats($db_under_test, 'relation', $dboid, $tableoid), 'f', "$sect: relation stats do not exist"); @@ -307,9 +307,9 @@ sub trigger_funcrel_stat sub have_stats { - my ($kind, $dboid, $objid) = @_; + my ($db, $kind, $dboid, $objid) = @_; - return $node->safe_psql($connect_db, + return $node->safe_psql($db, "SELECT pg_stat_have_stats('$kind', $dboid, $objid)"); } -- 2.34.1 [text/x-diff] v11-0002-handle-relation-statistics-correctly-during-rewr.patch (25.9K, 3-v11-0002-handle-relation-statistics-correctly-during-rewr.patch) download | inline diff: From 7a5ad002b22eb54256e7f3481f497f1cda2deca9 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Tue, 4 Nov 2025 13:52:46 +0000 Subject: [PATCH v11 2/2] handle relation statistics correctly during rewrites Now that PGSTAT_KIND_RELATION is keyed by refilenode, we need to handle rewrites. To do so, this patch: - Adds PgStat_PendingRewrite, a new struct to track rewrite operations within a transaction, storing the old locator, new locator, and original locator (for rewrite chains). This allows stats to be copied from the original location to the final location at commit time. - Adds a new function, pgstat_mark_rewrite(), called when a table rewrite begins. It records the rewrite operation in a local list and detects rewrite chains by checking if the old_locator matches any existing new_locator, preserving the chain's original_locator. - Modifies pgstat_copy_relation_stats(), to accept RelFileLocators instead of Relations, with a new increment parameter to accumulate stats (needed for rewrite chains with DML between rewrites). - Ensures that AtEOXact_PgStat_Relations(), AtPrepare_PgStat_Relations(), pgstat_twophase_postcommit()/postabort() pgstat_drop_relation() handle the PgStat_PendingRewrite list correctly. Note that due to the new flush call in pgstat_twophase_postcommit() we can not call GetCurrentTransactionStopTimestamp() in pgstat_relation_flush_cb(). So, adding a check to handle this special case and call GetCurrentTimestamp() instead. Note that we'd call GetCurrentTimestamp() only if there is a rewrite, so that the GetCurrentTimestamp() extra cost should be negligible. Another solution could be to trigger the flush from FinishPreparedTransaction() but that's not worth the extra complexity. The new pending_rewrites list is traversed in multiple places. The overhead should be negligible in comparison to a rewrite and the list should not contain a lot of rewrites in practice. The pending_rewrites list is traversed in multiple places. In typical usage, the list will contain only a few entries so the traversal cost is negligible ( furthermore in comparison to a rewrite). --- src/backend/catalog/index.c | 2 +- src/backend/commands/cluster.c | 5 + src/backend/commands/tablecmds.c | 6 + src/backend/utils/activity/pgstat_relation.c | 391 ++++++++++++++++++- src/backend/utils/activity/pgstat_xact.c | 25 +- src/backend/utils/cache/relcache.c | 6 + src/include/pgstat.h | 5 +- src/tools/pgindent/typedefs.list | 1 + 8 files changed, 424 insertions(+), 17 deletions(-) 92.8% src/backend/utils/activity/ 4.9% src/backend/ diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 43de42ce39e..88ef00609db 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -1793,7 +1793,7 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName) changeDependenciesOn(RelationRelationId, oldIndexId, newIndexId); /* copy over statistics from old to new index */ - pgstat_copy_relation_stats(newClassRel, oldClassRel); + pgstat_copy_relation_stats(newClassRel->rd_locator, oldClassRel->rd_locator, false); /* Copy data of pg_statistic from the old index to the new one */ CopyStatistics(oldIndexId, newIndexId); diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index 60a4617a585..3a446f32921 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -1196,6 +1196,11 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class, rel1 = relation_open(r1, NoLock); rel2 = relation_open(r2, NoLock); + + /* Mark that a rewrite happened */ + if (RELKIND_HAS_STORAGE(rel1->rd_rel->relkind)) + pgstat_mark_rewrite(rel1->rd_locator, rel2->rd_locator); + rel2->rd_createSubid = rel1->rd_createSubid; rel2->rd_newRelfilelocatorSubid = rel1->rd_newRelfilelocatorSubid; rel2->rd_firstRelfilelocatorSubid = rel1->rd_firstRelfilelocatorSubid; diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index df1ba112b35..6244d04ab12 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -16896,6 +16896,7 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) Oid reltoastrelid; RelFileNumber newrelfilenumber; RelFileLocator newrlocator; + RelFileLocator oldrlocator; List *reltoastidxids = NIL; ListCell *lc; @@ -16934,6 +16935,7 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) newrlocator = rel->rd_locator; newrlocator.relNumber = newrelfilenumber; newrlocator.spcOid = newTableSpace; + oldrlocator = rel->rd_locator; /* hand off to AM to actually create new rel storage and copy the data */ if (rel->rd_rel->relkind == RELKIND_INDEX) @@ -16946,6 +16948,10 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) table_relation_copy_data(rel, &newrlocator); } + /* mark that a rewrite happened */ + if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind)) + pgstat_mark_rewrite(oldrlocator, newrlocator); + /* * Update the pg_class row. * diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c index 89bf0cbed56..4d929972fab 100644 --- a/src/backend/utils/activity/pgstat_relation.c +++ b/src/backend/utils/activity/pgstat_relation.c @@ -30,6 +30,19 @@ #include "utils/syscache.h" #include "utils/timestamp.h" +/* Pending rewrite operations for stats copying */ +typedef struct PgStat_PendingRewrite +{ + RelFileLocator old_locator; + RelFileLocator new_locator; + RelFileLocator original_locator; + int nest_level; /* Transaction nesting level where rewrite + * occurred */ + struct PgStat_PendingRewrite *next; +} PgStat_PendingRewrite; + +/* The pending rewrites list for current transaction */ +static PgStat_PendingRewrite *pending_rewrites = NULL; /* Record that's written to 2PC state file when pgstat state is persisted */ typedef struct TwoPhasePgStatRecord @@ -43,6 +56,8 @@ typedef struct TwoPhasePgStatRecord PgStat_Counter deleted_pre_truncdrop; RelFileLocator locator; /* table's rd_locator */ bool truncdropped; /* was the relation truncated/dropped? */ + RelFileLocator rewrite_old_locator; + int rewrite_nest_level; } TwoPhasePgStatRecord; @@ -54,27 +69,70 @@ static void restore_truncdrop_counters(PgStat_TableXactStatus *trans); /* - * Copy stats between relations. This is used for things like REINDEX + * Copy stats between RelFileLocator. This is used for things like REINDEX * CONCURRENTLY. */ void -pgstat_copy_relation_stats(Relation dst, Relation src) +pgstat_copy_relation_stats(RelFileLocator dst, RelFileLocator src, bool increment) { PgStat_StatTabEntry *srcstats; PgStatShared_Relation *dstshstats; PgStat_EntryRef *dst_ref; - srcstats = pgstat_fetch_stat_tabentry_ext(RelationGetRelid(src)); + srcstats = (PgStat_StatTabEntry *) pgstat_fetch_entry(PGSTAT_KIND_RELATION, + src.dbOid, + RelFileLocatorToPgStatObjid(src)); if (!srcstats) return; dst_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_RELATION, - dst->rd_rel->relisshared ? InvalidOid : MyDatabaseId, - RelationGetRelid(dst), + dst.dbOid, + RelFileLocatorToPgStatObjid(dst), false); dstshstats = (PgStatShared_Relation *) dst_ref->shared_stats; - dstshstats->stats = *srcstats; + + if (!increment) + dstshstats->stats = *srcstats; + else + { + /* Increment those statistics */ +#define RELFSTAT_ACC(fld, stats_to_add) \ + (dstshstats->stats.fld += stats_to_add->fld) + RELFSTAT_ACC(numscans, srcstats); + RELFSTAT_ACC(tuples_returned, srcstats); + RELFSTAT_ACC(tuples_fetched, srcstats); + RELFSTAT_ACC(tuples_inserted, srcstats); + RELFSTAT_ACC(tuples_updated, srcstats); + RELFSTAT_ACC(tuples_deleted, srcstats); + RELFSTAT_ACC(tuples_hot_updated, srcstats); + RELFSTAT_ACC(tuples_newpage_updated, srcstats); + RELFSTAT_ACC(live_tuples, srcstats); + RELFSTAT_ACC(dead_tuples, srcstats); + RELFSTAT_ACC(mod_since_analyze, srcstats); + RELFSTAT_ACC(ins_since_vacuum, srcstats); + RELFSTAT_ACC(blocks_fetched, srcstats); + RELFSTAT_ACC(blocks_hit, srcstats); + RELFSTAT_ACC(vacuum_count, srcstats); + RELFSTAT_ACC(autovacuum_count, srcstats); + RELFSTAT_ACC(analyze_count, srcstats); + RELFSTAT_ACC(autoanalyze_count, srcstats); + RELFSTAT_ACC(total_vacuum_time, srcstats); + RELFSTAT_ACC(total_autovacuum_time, srcstats); + RELFSTAT_ACC(total_analyze_time, srcstats); + RELFSTAT_ACC(total_autoanalyze_time, srcstats); +#undef RELFSTAT_ACC + + /* Replace those statistics */ +#define RELFSTAT_REP(fld, stats_to_rep) \ + (dstshstats->stats.fld = stats_to_rep->fld) + RELFSTAT_REP(lastscan, srcstats); + RELFSTAT_REP(last_vacuum_time, srcstats); + RELFSTAT_REP(last_autovacuum_time, srcstats); + RELFSTAT_REP(last_analyze_time, srcstats); + RELFSTAT_REP(last_autoanalyze_time, srcstats); +#undef RELFSTAT_REP + } pgstat_unlock_entry(dst_ref); } @@ -136,6 +194,7 @@ void pgstat_assoc_relation(Relation rel) { RelFileLocator locator; + PgStat_TableStatus *pgstat_info; Assert(rel->pgstat_enabled); Assert(rel->pgstat_info == NULL); @@ -164,14 +223,54 @@ pgstat_assoc_relation(Relation rel) locator.relNumber = rel->rd_id; } + /* + * If this relation was rewritten during the current transaction we may be + * reopening it with its new RelFileLocator. In that case, continue using + * the stats entry associated with the old locator rather than creating a + * new one. This ensures all stats from before and after the rewrite are + * tracked in a single entry which will be properly copied to the new + * locator at transaction commit. + */ + if (pending_rewrites != NULL) + { + PgStat_PendingRewrite *rewrite; + + for (rewrite = pending_rewrites; rewrite != NULL; rewrite = rewrite->next) + { + if (locator.dbOid == rewrite->new_locator.dbOid && + locator.spcOid == rewrite->new_locator.spcOid && + locator.relNumber == rewrite->new_locator.relNumber) + { + pgstat_info = pgstat_prep_relation_pending(rewrite->old_locator); + goto found_entry; + } + } + } + /* Else find or make the PgStat_TableStatus entry, and update link */ - rel->pgstat_info = pgstat_prep_relation_pending(locator); + pgstat_info = pgstat_prep_relation_pending(locator); + +found_entry: + rel->pgstat_info = pgstat_info; + + /* + * For relations stats, we key by physical file location, not by relation + * OID. This means during operations like ALTER TYPE it's possible that + * the relation OID changes but the relfilenode stays the same (no actual + * rewrite needed). Unlink the old relation first. + */ + if (pgstat_info->relation != NULL && + pgstat_info->relation != rel) + { + pgstat_info->relation->pgstat_info = NULL; + pgstat_info->relation = NULL; + } /* don't allow link a stats to multiple relcache entries */ - Assert(rel->pgstat_info->relation == NULL); + Assert(pgstat_info->relation == NULL); /* mark this relation as the owner */ - rel->pgstat_info->relation = rel; + pgstat_info->relation = rel; } /* @@ -214,14 +313,37 @@ pgstat_drop_relation(Relation rel) { int nest_level = GetCurrentTransactionNestLevel(); PgStat_TableStatus *pgstat_info; + bool skip_transactional_drop = false; /* don't track stats for relations without storage */ if (!RELKIND_HAS_STORAGE(rel->rd_rel->relkind)) return; - pgstat_drop_transactional(PGSTAT_KIND_RELATION, - rel->rd_locator.dbOid, - RelFileLocatorToPgStatObjid(rel->rd_locator)); + /* Check if this drop is part of a pending rewrite */ + if (pending_rewrites != NULL) + { + PgStat_PendingRewrite *rewrite; + + for (rewrite = pending_rewrites; rewrite != NULL; rewrite = rewrite->next) + { + if (rel->rd_locator.dbOid == rewrite->old_locator.dbOid && + rel->rd_locator.spcOid == rewrite->old_locator.spcOid && + rel->rd_locator.relNumber == rewrite->old_locator.relNumber) + { + skip_transactional_drop = true; + break; + } + } + } + + /* + * If it is part of a rewrite, drop its stats later, for example in + * AtEOXact_PgStat_Relations(), so skip it here. + */ + if (!skip_transactional_drop) + pgstat_drop_transactional(PGSTAT_KIND_RELATION, + rel->rd_locator.dbOid, + RelFileLocatorToPgStatObjid(rel->rd_locator)); if (!pgstat_should_count_relation(rel)) return; @@ -666,6 +788,48 @@ AtEOXact_PgStat_Relations(PgStat_SubXactStatus *xact_state, bool isCommit) } tabstat->trans = NULL; } + + /* preserve the stats in case of rewrite */ + if (isCommit && pending_rewrites != NULL) + { + PgStat_PendingRewrite *rewrite; + PgStat_PendingRewrite *prev = NULL; + PgStat_PendingRewrite *current = pending_rewrites; + PgStat_PendingRewrite *next; + + /* reverse the rewrites list to process in chronological order */ + while (current != NULL) + { + next = current->next; + current->next = prev; + prev = current; + current = next; + } + + /* now process rewrites in chronological order */ + for (rewrite = prev; rewrite != NULL; rewrite = rewrite->next) + { + PgStat_EntryRef *old_entry_ref; + + old_entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, + rewrite->old_locator.dbOid, + RelFileLocatorToPgStatObjid(rewrite->old_locator)); + + if (old_entry_ref && old_entry_ref->pending) + pgstat_relation_flush_cb(old_entry_ref, false); + + pgstat_copy_relation_stats(rewrite->new_locator, + rewrite->old_locator, true); + + /* drop old locator's stats */ + if (!pgstat_drop_entry(PGSTAT_KIND_RELATION, + rewrite->old_locator.dbOid, + RelFileLocatorToPgStatObjid(rewrite->old_locator))) + pgstat_request_entry_refs_gc(); + } + } + + pending_rewrites = NULL; } /* @@ -681,6 +845,30 @@ AtEOSubXact_PgStat_Relations(PgStat_SubXactStatus *xact_state, bool isCommit, in PgStat_TableXactStatus *trans; PgStat_TableXactStatus *next_trans; + /* + * If we don't commit then remove the associated rewrites if any, to keep + * the rewrite chain in sync with what will be eventually committed. + */ + if (!isCommit) + { + PgStat_PendingRewrite **rewrite_ptr = &pending_rewrites; + + while (*rewrite_ptr != NULL) + { + if ((*rewrite_ptr)->nest_level >= nestDepth) + { + PgStat_PendingRewrite *to_remove = *rewrite_ptr; + + *rewrite_ptr = (*rewrite_ptr)->next; + pfree(to_remove); + } + else + { + rewrite_ptr = &((*rewrite_ptr)->next); + } + } + } + for (trans = xact_state->first; trans != NULL; trans = next_trans) { PgStat_TableStatus *tabstat; @@ -760,11 +948,19 @@ void AtPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state) { PgStat_TableXactStatus *trans; + PgStat_PendingRewrite *rewrite; + /* + * For each tabstat, find its matching rewrite and remove it from the + * pending rewrites list. This way, after processing all tabstats, pending + * rewrites will only contain rewrite only transactions. + */ for (trans = xact_state->first; trans != NULL; trans = trans->next) { PgStat_TableStatus *tabstat PG_USED_FOR_ASSERTS_ONLY; TwoPhasePgStatRecord record; + PgStat_PendingRewrite **rewrite_ptr; + bool found_rewrite = false; Assert(trans->nest_level == 1); Assert(trans->upper == NULL); @@ -784,10 +980,83 @@ AtPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state) record.locator = tabstat->locator; record.truncdropped = trans->truncdropped; + record.rewrite_nest_level = 0; + + /* + * Look for a matching rewrite and remove it from pending rewrites. We + * check three possible matches: + * + * The new_locator when stats have been added after the rewrite. The + * old_locator when stats have been added before the rewrite but not + * after. The original_locator when this tabstat is part of a rewrite + * chain. + */ + rewrite_ptr = &pending_rewrites; + while (*rewrite_ptr != NULL) + { + rewrite = *rewrite_ptr; + + if ((record.locator.dbOid == rewrite->new_locator.dbOid && + record.locator.spcOid == rewrite->new_locator.spcOid && + record.locator.relNumber == rewrite->new_locator.relNumber) || + (tabstat->locator.dbOid == rewrite->old_locator.dbOid && + tabstat->locator.spcOid == rewrite->old_locator.spcOid && + tabstat->locator.relNumber == rewrite->old_locator.relNumber) || + (tabstat->locator.dbOid == rewrite->original_locator.dbOid && + tabstat->locator.spcOid == rewrite->original_locator.spcOid && + tabstat->locator.relNumber == rewrite->original_locator.relNumber)) + { + /* + * Found matching rewrite. Record the rewrite information and + * remove this rewrite from the list since it's now handled. + */ + record.rewrite_old_locator = rewrite->original_locator; + record.rewrite_nest_level = rewrite->nest_level; + record.locator = rewrite->new_locator; + found_rewrite = true; + + /* Remove from pending_rewrites list */ + *rewrite_ptr = rewrite->next; + pfree(rewrite); + break; + } + else + { + /* Move to next rewrite in the list */ + rewrite_ptr = &(rewrite->next); + } + } + + /* If no rewrite found, clear the rewrite fields */ + if (!found_rewrite) + { + memset(&record.rewrite_old_locator, 0, sizeof(RelFileLocator)); + } + + RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0, + &record, sizeof(TwoPhasePgStatRecord)); + } + + /* + * Now process any rewrites still pending. These are rewrite only + * transactions. We need to preserve their stats even though there's no + * tabstat entry for them. + */ + for (rewrite = pending_rewrites; rewrite != NULL; rewrite = rewrite->next) + { + TwoPhasePgStatRecord record; + + memset(&record, 0, sizeof(TwoPhasePgStatRecord)); + record.locator = rewrite->new_locator; + record.rewrite_old_locator = rewrite->original_locator; + record.rewrite_nest_level = rewrite->nest_level; + record.truncdropped = false; RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0, &record, sizeof(TwoPhasePgStatRecord)); } + + pending_rewrites = NULL; } /* @@ -810,6 +1079,8 @@ PostPrepare_PgStat_Relations(PgStat_SubXactStatus *xact_state) tabstat = trans->parent; tabstat->trans = NULL; } + + pending_rewrites = NULL; } /* @@ -845,6 +1116,29 @@ pgstat_twophase_postcommit(FullTransactionId fxid, uint16 info, pgstat_info->counts.changed_tuples += rec->tuples_inserted + rec->tuples_updated + rec->tuples_deleted; + + if (rec->rewrite_nest_level > 0) + { + PgStat_EntryRef *old_entry_ref; + + /* Flush any pending stats for old locator first */ + old_entry_ref = pgstat_fetch_pending_entry(PGSTAT_KIND_RELATION, + rec->rewrite_old_locator.dbOid, + RelFileLocatorToPgStatObjid(rec->rewrite_old_locator)); + + if (old_entry_ref && old_entry_ref->pending) + pgstat_relation_flush_cb(old_entry_ref, false); + + /* Copy stats from old to new locator */ + pgstat_copy_relation_stats(rec->locator, rec->rewrite_old_locator, + true); + + /* Drop old locator's stats */ + if (!pgstat_drop_entry(PGSTAT_KIND_RELATION, + rec->rewrite_old_locator.dbOid, + RelFileLocatorToPgStatObjid(rec->rewrite_old_locator))) + pgstat_request_entry_refs_gc(); + } } /* @@ -859,9 +1153,26 @@ pgstat_twophase_postabort(FullTransactionId fxid, uint16 info, { TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata; PgStat_TableStatus *pgstat_info; + RelFileLocator target_locator; + + /* + * For aborted transactions with rewrites (like TRUNCATE), we need to + * restore stats to the old locator, not the new one. The new locator + * should be dropped since the rewrite is being rolled back. + */ + if (rec->rewrite_nest_level > 0) + { + /* Use the old locator */ + target_locator = rec->rewrite_old_locator; + } + else + { + /* No rewrite, use the original locator */ + target_locator = rec->locator; + } /* Find or create a tabstat entry for the target locator */ - pgstat_info = pgstat_prep_relation_pending(rec->locator); + pgstat_info = pgstat_prep_relation_pending(target_locator); /* Same math as in AtEOXact_PgStat, abort case */ if (rec->truncdropped) @@ -916,7 +1227,17 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait) tabentry->numscans += lstats->counts.numscans; if (lstats->counts.numscans) { - TimestampTz t = GetCurrentTransactionStopTimestamp(); + TimestampTz t; + + /* + * Checking the transaction state due to the flush call in + * pgstat_twophase_postcommit() that would break the assertion on the + * state in GetCurrentTransactionStopTimestamp(). + */ + if (!IsTransactionState()) + t = GetCurrentTransactionStopTimestamp(); + else + t = GetCurrentTimestamp(); if (t > tabentry->lastscan) tabentry->lastscan = t; @@ -1167,3 +1488,45 @@ pgstat_reloid_to_relfilelocator(Oid reloid, RelFileLocator *locator) ReleaseSysCache(tuple); return result; } + +/* + * Mark that a relation rewrite has occurred, preserving the original locator + * so stats can be copied at transaction commit. + */ +void +pgstat_mark_rewrite(RelFileLocator old_locator, RelFileLocator new_locator) +{ + PgStat_PendingRewrite *rewrite; + PgStat_PendingRewrite *existing; + RelFileLocator original_locator = old_locator; + + for (existing = pending_rewrites; existing != NULL; existing = existing->next) + { + if (old_locator.dbOid == existing->new_locator.dbOid && + old_locator.spcOid == existing->new_locator.spcOid && + old_locator.relNumber == existing->new_locator.relNumber) + { + original_locator = existing->original_locator; + break; + } + } + + /* Allocate in TopTransactionContext memory context */ + rewrite = MemoryContextAlloc(TopTransactionContext, + sizeof(PgStat_PendingRewrite)); + + rewrite->old_locator = old_locator; + rewrite->new_locator = new_locator; + rewrite->original_locator = original_locator; + rewrite->nest_level = GetCurrentTransactionNestLevel(); + + /* Add to the list */ + rewrite->next = pending_rewrites; + pending_rewrites = rewrite; +} + +void +pgstat_clear_rewrite(void) +{ + pending_rewrites = NULL; +} diff --git a/src/backend/utils/activity/pgstat_xact.c b/src/backend/utils/activity/pgstat_xact.c index 5e2d69e6297..8ed8f5317f3 100644 --- a/src/backend/utils/activity/pgstat_xact.c +++ b/src/backend/utils/activity/pgstat_xact.c @@ -55,6 +55,8 @@ AtEOXact_PgStat(bool isCommit, bool parallel) } pgStatXactStack = NULL; + pgstat_clear_rewrite(); + /* Make sure any stats snapshot is thrown away */ pgstat_clear_snapshot(); } @@ -360,8 +362,29 @@ create_drop_transactional_internal(PgStat_Kind kind, Oid dboid, uint64 objid, bo void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, uint64 objid) { - if (pgstat_get_entry_ref(kind, dboid, objid, false, NULL)) + PgStat_EntryRef *entry_ref; + + entry_ref = pgstat_get_entry_ref(kind, dboid, objid, false, NULL); + + if (entry_ref) { + /* + * For relations stats, we key by physical file location, not by + * relation OID. This means during operations like ALTER TYPE where + * the relation OID changes but the relfilenode stays the same (no + * actual rewrite needed), we'll find an existing entry. + * + * This is expected behavior, we want to preserve stats across the + * catalog change. Simply reset and recreate the entry for the new + * relation OID without warning. + */ + if (kind == PGSTAT_KIND_RELATION) + { + pgstat_reset(kind, dboid, objid); + create_drop_transactional_internal(kind, dboid, objid, true); + return; + } + ereport(WARNING, errmsg("resetting existing statistics for kind %s, db=%u, oid=%" PRIu64, (pgstat_get_kind_info(kind))->name, dboid, diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 6b634c9fff1..692d084acb2 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -85,6 +85,7 @@ #include "utils/inval.h" #include "utils/lsyscache.h" #include "utils/memutils.h" +#include "utils/pgstat_internal.h" #include "utils/relmapper.h" #include "utils/resowner.h" #include "utils/snapmgr.h" @@ -3775,6 +3776,7 @@ RelationSetNewRelfilenumber(Relation relation, char persistence) MultiXactId minmulti = InvalidMultiXactId; TransactionId freezeXid = InvalidTransactionId; RelFileLocator newrlocator; + RelFileLocator oldrlocator = relation->rd_locator; if (!IsBinaryUpgrade) { @@ -3946,6 +3948,10 @@ RelationSetNewRelfilenumber(Relation relation, char persistence) table_close(pg_class, RowExclusiveLock); + /* Mark that a rewrite happened */ + if (RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) + pgstat_mark_rewrite(oldrlocator, newrlocator); + /* * Make the pg_class row change or relation map change visible. This will * cause the relcache entry to get updated, too. diff --git a/src/include/pgstat.h b/src/include/pgstat.h index e9e4d32c3b8..a9a4fa9f8f2 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -669,7 +669,7 @@ extern PgStat_FunctionCounts *find_funcstat_entry(Oid func_id); extern void pgstat_create_relation(Relation rel); extern void pgstat_drop_relation(Relation rel); -extern void pgstat_copy_relation_stats(Relation dst, Relation src); +extern void pgstat_copy_relation_stats(RelFileLocator dst, RelFileLocator src, bool increment); extern void pgstat_init_relation(Relation rel); extern void pgstat_assoc_relation(Relation rel); @@ -681,6 +681,9 @@ extern void pgstat_report_vacuum(Relation rel, PgStat_Counter livetuples, extern void pgstat_report_analyze(Relation rel, PgStat_Counter livetuples, PgStat_Counter deadtuples, bool resetcounter, TimestampTz starttime); +extern void pgstat_mark_rewrite(RelFileLocator old_locator, + RelFileLocator new_locator); +extern void pgstat_clear_rewrite(void); /* * If stats are enabled, but pending data hasn't been prepared yet, call diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 241945734ec..05b1380c08c 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -2279,6 +2279,7 @@ PgStat_KindInfo PgStat_LocalState PgStat_PendingDroppedStatsItem PgStat_PendingIO +PgStat_PendingRewrite PgStat_SLRUStats PgStat_ShmemControl PgStat_Snapshot -- 2.34.1 ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2026-03-09 07:43 Bertrand Drouvot <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2026-03-09 07:43 UTC (permalink / raw) To: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Mon, Feb 23, 2026 at 06:39:03AM +0000, Bertrand Drouvot wrote: > Hi, > > On Tue, Jan 13, 2026 at 09:29:02AM +0000, Bertrand Drouvot wrote: > > Hi, > > > > On Tue, Dec 16, 2025 at 10:39:15AM -0500, Andres Freund wrote: > > > On 2025-12-16 16:33:17 +0900, Michael Paquier wrote: > > > > Andres, Michael, let me try to sum up my understanding of the current state > > and see how we could now move forward. > > > > First of all, I understand that you both think that the patch outcome will be > > useful to have. The current debate is about the design, the current status is: > > > > - Andres raised specific technical/implementation concerns and I've proposed > > solutions in [1]. It also looks like Andres supports the overall design approach. > > - Michael is not really ok with the current design approach. > > > > That means, that with the current design in place, Michael would probably not > > commit it (even after review(s)). > > > > Given that I'm also in favor of the current proposed design, this raises the > > questions: > > > > - Andres, would you commit such a patch (after review iteration(s) of course)? > > - Michael, if Andres is ok with the above, would you still offer your help for the > > review part (even if the design is not what you "prefer"/"like")? > > > > [1]: https://postgr.es/m/aUEyzoOJtrCLAEeT%40ip-10-97-1-34.eu-west-3.compute.internal > > PFA, tiny rebase due to 9842e8aca09. PFA, a new mandatory rebase. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2026-03-18 03:57 Bertrand Drouvot <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2026-03-18 03:57 UTC (permalink / raw) To: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Mon, Mar 09, 2026 at 07:43:43AM +0000, Bertrand Drouvot wrote: > Hi, > > On Mon, Feb 23, 2026 at 06:39:03AM +0000, Bertrand Drouvot wrote: > > Hi, > > > > On Tue, Jan 13, 2026 at 09:29:02AM +0000, Bertrand Drouvot wrote: > > > Hi, > > > > > > On Tue, Dec 16, 2025 at 10:39:15AM -0500, Andres Freund wrote: > > > > On 2025-12-16 16:33:17 +0900, Michael Paquier wrote: > > > > > > Andres, Michael, let me try to sum up my understanding of the current state > > > and see how we could now move forward. > > > > > > First of all, I understand that you both think that the patch outcome will be > > > useful to have. The current debate is about the design, the current status is: > > > > > > - Andres raised specific technical/implementation concerns and I've proposed > > > solutions in [1]. It also looks like Andres supports the overall design approach. > > > - Michael is not really ok with the current design approach. > > > > > > That means, that with the current design in place, Michael would probably not > > > commit it (even after review(s)). > > > > > > Given that I'm also in favor of the current proposed design, this raises the > > > questions: > > > > > > - Andres, would you commit such a patch (after review iteration(s) of course)? > > > - Michael, if Andres is ok with the above, would you still offer your help for the > > > review part (even if the design is not what you "prefer"/"like")? > > > > > > [1]: https://postgr.es/m/aUEyzoOJtrCLAEeT%40ip-10-97-1-34.eu-west-3.compute.internal > > > > PFA, tiny rebase due to 9842e8aca09. > > PFA, a new mandatory rebase. PFA, new rebase due to fba4233c832. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2026-03-25 03:25 Bertrand Drouvot <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2026-03-25 03:25 UTC (permalink / raw) To: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Wed, Mar 18, 2026 at 03:57:48AM +0000, Bertrand Drouvot wrote: > Hi, > > PFA, new rebase due to fba4233c832. Another rebase, due to 2102ebb1953 this time. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2026-03-31 10:45 Bertrand Drouvot <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 1 reply; 24+ messages in thread From: Bertrand Drouvot @ 2026-03-31 10:45 UTC (permalink / raw) To: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Wed, Mar 25, 2026 at 03:25:07AM +0000, Bertrand Drouvot wrote: > Hi, > > On Wed, Mar 18, 2026 at 03:57:48AM +0000, Bertrand Drouvot wrote: > > Hi, > > > > PFA, new rebase due to fba4233c832. > > Another rebase, due to 2102ebb1953 this time. It's more than probably too late for v19 but it needs another rebase due to d7965d65fc5b this time. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
* Re: relfilenode statistics @ 2026-05-18 16:28 Bertrand Drouvot <[email protected]> parent: Bertrand Drouvot <[email protected]> 0 siblings, 0 replies; 24+ messages in thread From: Bertrand Drouvot @ 2026-05-18 16:28 UTC (permalink / raw) To: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; +Cc: Kirill Reshke <[email protected]>; Robert Haas <[email protected]>; [email protected] Hi, On Tue, Mar 31, 2026 at 10:45:50AM +0000, Bertrand Drouvot wrote: > Hi, > > On Wed, Mar 25, 2026 at 03:25:07AM +0000, Bertrand Drouvot wrote: > > Hi, > > > > On Wed, Mar 18, 2026 at 03:57:48AM +0000, Bertrand Drouvot wrote: > > > Hi, > > > > > > PFA, new rebase due to fba4233c832. > > > > Another rebase, due to 2102ebb1953 this time. > > It's more than probably too late for v19 but it needs another rebase due to > d7965d65fc5b this time. PFA v16, a rebase due to 775fe51daae, 71ff232a5bc and c0b53ec0630. While at it, let's sum up the current state: Regarding Michael's question [1] about whether we should copy stats across rewrites: I still believe we should. Not doing so would produce user-visible regressions. The complexity is contained in patch 0002 and the approach is tested (including 2PC, subtransaction abort, and rewrite chains). Regarding Michael's suggestion [2] to split PgStat_StatTabEntry into three kinds (table/index/relfilenode) from the start: I think this patch is the right incremental step that doesn't preclude a future split. Here's my reasoning: As Andres pointed out [3], we'd want to populate more than just dead_tuples/ins_since_vacuum/mod_since_analyze during recovery. The right boundary for a split isn't clear yet until we actually implement WAL-replay-based stat population. I think that splitting now would be a much larger change with the risk of drawing the boundaries wrong. The current approach (key PGSTAT_KIND_RELATION by locator, keep the unified structure) is a contained change that unblocks future work. Once we have WAL replay populating stats, we'll have a much better understanding of what a split should look like, if one is still needed. I think we should do this incremental step first, then split later if/when the need becomes clearer. I believe we have consensus on the core approach ("use the relfilenumber instead of the relation OID, without changing the user experience"). The implementation addresses all the technical concerns raised so far (no new hash key field, PSEUDO_PARTITION_TABLE_SPCOID for partitioned tables, pgstat_fetch_stat_tabentry_by_locator() to avoid extra syscache lookups in do_autovacuum()). Andres, would you be willing to drive this toward commit once we've iterated on any remaining review feedback? Michael, I understand this isn't the design you'd prefer. Would you be open to reviewing the implementation nonetheless, or do you have a hard objection that would block this path? I'm happy to address any further concerns. [1]: https://postgr.es/m/aRGoGcOdutTHQfpn%40paquier.xyz [2]: https://postgr.es/m/aUELPdhdcyzTM_8K%40paquier.xyz [3]: https://postgr.es/m/zferux2jlbhqymubzhpubfrkjzhzxzguq4eprtycojtif5vbqh%402t7cu2teyqmi Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com ^ permalink raw reply [nested|flat] 24+ messages in thread
end of thread, other threads:[~2026-05-18 16:28 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2025-03-13 09:00 Re: relfilenode statistics Kirill Reshke <[email protected]> 2025-09-16 06:44 ` Michael Paquier <[email protected]> 2025-09-30 10:13 ` Bertrand Drouvot <[email protected]> 2025-09-30 23:05 ` Michael Paquier <[email protected]> 2025-10-01 14:33 ` Bertrand Drouvot <[email protected]> 2025-10-02 01:23 ` Michael Paquier <[email protected]> 2025-11-07 11:28 ` Bertrand Drouvot <[email protected]> 2025-11-08 23:33 ` Michael Paquier <[email protected]> 2025-11-10 08:53 ` Michael Paquier <[email protected]> 2025-11-12 17:03 ` Bertrand Drouvot <[email protected]> 2025-12-15 16:29 ` Bertrand Drouvot <[email protected]> 2025-12-15 17:48 ` Andres Freund <[email protected]> 2025-12-16 07:33 ` Michael Paquier <[email protected]> 2025-12-16 10:24 ` Bertrand Drouvot <[email protected]> 2025-12-16 15:39 ` Andres Freund <[email protected]> 2026-01-13 09:29 ` Bertrand Drouvot <[email protected]> 2026-02-23 06:39 ` Bertrand Drouvot <[email protected]> 2026-03-09 07:43 ` Bertrand Drouvot <[email protected]> 2026-03-18 03:57 ` Bertrand Drouvot <[email protected]> 2026-03-25 03:25 ` Bertrand Drouvot <[email protected]> 2026-03-31 10:45 ` Bertrand Drouvot <[email protected]> 2026-05-18 16:28 ` Bertrand Drouvot <[email protected]> 2025-12-16 10:22 ` Bertrand Drouvot <[email protected]> 2025-12-17 07:30 ` Bertrand Drouvot <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox