Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8i5w-000nor-2I for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 17:14:09 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8i5u-00CjZj-0j for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 17:14:06 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8i5t-00CjZa-2u for pgsql-hackers@lists.postgresql.org; Fri, 03 Apr 2026 17:14:06 +0000 Received: from mail-ej1-x62a.google.com ([2a00:1450:4864:20::62a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w8i5r-00000000PMc-2gNl for pgsql-hackers@lists.postgresql.org; Fri, 03 Apr 2026 17:14:06 +0000 Received: by mail-ej1-x62a.google.com with SMTP id a640c23a62f3a-b9358bc9c50so295641166b.1 for ; Fri, 03 Apr 2026 10:14:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775236442; cv=none; d=google.com; s=arc-20240605; b=jwmE4KnzgX5MBFAEckTOSZqMoZwAkzUrMR48PWEo0eL3SMI+DYyL5cuI9mJyo+iNK8 +fB/ksupOukVlvonqM7kmwsSin/xNgLDV7SB3Gfk00pmM+o5MqpimIyNeJsEVZEVY9ku 8UZg4tXaTy3XDmBG9zxynOVVDbVFwEVBDUuZeoseV0rvkHC04q1YTXmEutyHKjSVcxC5 jOZZ/MZniGlRFlEJCEwYhiWnIAPqSadEol0YyqCyPdAFAr3zXxzNgP4Rq98PEpWKvWpY NOjwehKQssg5eOKCQWnd850SVH5aZcz9bEhitG9tfDQKpNEWEn2GhGVNO68Zqh6/ywvX KwRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=IoK3gPuZu/RDo5WZ+1gwJntvwjryz9kqJR4O7KBgtNo=; fh=W41B8Kw8+14ky1WGffBYkepzVuaCsScze9eekpMWBGE=; b=iAYS+6/GuJTS3gOStS6saoaJ11Zb5yUVJglSdp4avRj6fJu4dzyoWA7A3FuulZkS0j GQH+0YKl6UoAlQuIUUnYcdrLC+VkXftwVMz0y6SS/3loAxHLyZDDkU2E0l0+xXxzvY+O OGhCUCCSTJpGN2VxZ3u2eKGfaBC7RWhcbmBKW6ZO9A67c2uFlDU2AQ4vU/7TTTL++AYe w+eRbS3vX9M3QTCYQr8iVRgNAMpFyCnZAzh0yKypdMs1Rd8YgXV3Q6WdMLSd2/25SVjo TaxnRlJAuLuOAOKp/QIca5tThnb90IRmzNSldGQUlGVKgWEnRO34ysQ850WXgs9w6Qpq 8AQA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775236442; x=1775841242; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=IoK3gPuZu/RDo5WZ+1gwJntvwjryz9kqJR4O7KBgtNo=; b=AxB5USlz3x6uxqyt8QtP8Jyl837llL3K9JbqBDts2dNmfteoLI6ddCVQoQDFYlRBUV EAH6I2aZA7ALGTG2GuRe1UWXhj3E+9JOcpo6ZA8ftlxX6X6MkktpnI176eoJ0K9vITKF itXHFsjrFE+3hbq+nDIuXA9k/JtZm4rm51qixTZvcUYZnOoxo50iLnEfduFJy/oK4eyr S5V3Zk3yvgi97QuBU2buN5Yn25GkC06Y1GowsQGUUE5zW8eOU5wHqSV6TMUBJVlY6XlX WRLaN/eYUHgqG0hnXrKsZ2Ur/XR6XUohX0HVvlJYhDPu1/KjPGEcVDyQzRkgWgT71cMe nV2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775236442; x=1775841242; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=IoK3gPuZu/RDo5WZ+1gwJntvwjryz9kqJR4O7KBgtNo=; b=FYy3VlRCJ4VMg3i/4NDViqOKXGwJyHNsn3mRIqduKt+abYE/MvWtZKs5HR47toCVhE NsT/f571KQkdaZrr0DCGzlhiKpICG5BdWuffm+bQlqcW1ZdrUpCKgM3PK/CCqMZYIa50 6TL692fXDzUYBn18irUrBbbhLRvFabnI6vHdraFJvedA1tvrNqq4vrQr8j6DQ5y5hGxC wtKGqfFdKprQTXAUk0vFNK1DI7UGv44T6Tw3ogTpGHnWQqqhQ4YP0lV6slUs8JvInQXU hjlZ3/UIkqoL6o6cN7er0GzCWctGgoo0/Y4nOE7MfHdQqMmrJ5fhgJCcFRY2CuzUACBm b0eg== X-Forwarded-Encrypted: i=1; AJvYcCX60ExOjv2gE/f8nGbZeIyen26YzukrI0GE0Re/Su0ywtX0+27ggaL525v0Vb5g6JzOmIbQeVLix+KGCLGV@lists.postgresql.org X-Gm-Message-State: AOJu0YzQlm40h1f/tgSraZ7k76RaqRRgBjHfeRZrppZRDEF3HeNdfHfY VLzXVvVwvxU4KfSYoNINJCiUFwm2saH3mL76dNCxnQk+Klb8ecg2eQfLIvkscMLwTwybY+/NXJd HssqXIfIyQv2BxXSYHngZnbVb1Ufty74= X-Gm-Gg: AeBDieu2psJT+SQHSGS8/yE2/fPiiNjqbc5zvZb2V2la2Hf3Ld0GoIU8U84exCdLVLX JtnQMabkNFt/XuFvAY4p8K6EqVdYT1Ve1vpFl5dEg7TR8wr7aWdmAaYSpWJ8YDYdAA7Dw+bAgF/ aTKdcr7OLs5o/Nn/sHwWcq6h4LmKVTQQqowROJDF2rcEk0+AYtDoYIakr1DXLjGp5FiafBGOWLf zpPkH8pagaNxZK25jC7QvQURmVt+mszxmbCXvS/sDuQp7I61bRX7FfVCxRWO9wXTnB4CY/Wwp+D I256EuIDEtc6dg35MeJat+yl9N0pfs2POA3CQyfNtKmgMEY1cA== X-Received: by 2002:a17:906:ef06:b0:b9c:2c55:3384 with SMTP id a640c23a62f3a-b9c677fbd25mr176986066b.31.1775236441192; Fri, 03 Apr 2026 10:14:01 -0700 (PDT) MIME-Version: 1.0 References: <1299934.1773938807@sss.pgh.pa.us> <3683430.1775173413@sss.pgh.pa.us> In-Reply-To: From: Robert Haas Date: Fri, 3 Apr 2026 13:13:47 -0400 X-Gm-Features: AQROBzCtqV1jVCTXbv3P8qGoQUKHLRZHGZIWisUYNBrTK4ncO0Vj7euPBxnm55Y Message-ID: Subject: Re: pg_plan_advice To: Tom Lane Cc: Alexander Lakhin , Lukas Fittl , PostgreSQL Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, Apr 2, 2026 at 10:08=E2=80=AFPM Robert Haas = wrote: > I'm not sure, either, and I agree that we have a problem. I'll give it > some more thought tomorrow. OK, here are my thoughts. I don't believe it's viable to change pg_plan_advice in such a way that it won't run into trouble in cases like this. Somebody could argue that the choice of INDEX_SCAN(table_name index_name) was bad design, and that I should have done something like INDEX_SCAN(table_name indexed_columns) instead, and that might be true. There might also be an argument that we should have both things with different spellings, and that might also very well be true. But we don't really know that changing that design decision would fully stabilize test_plan_advice. The regression tests can do anything they like, as long as they reliably pass. It now seems optimistic to me to suppose that an index with a different name is the only current or future issue we'll ever have. I mean, if the table were small enough not to care about whether an index scan or a sequential scan is used, you could concurrently drop the one and only index altogether, and what's test_plan_advice supposed to do about that? So, I argue that there are three possible categories of solutions here: (1) don't let the problem cases happen in the first place, (2) detect that a problem has happened, or (3) give up on test_plan_advice. In category (1), the simplest idea would be (1a) to run the tests serially. That would probably involve running them much less often, like in one of the CI builds but not all of them or something like that. Another idea that I had is to (1b) try to take stronger locks on the relations involved to prevent concurrent DDL on them, like a ShareUpdateExclusiveLock, or (1c) some kind of bespoke interlock specific to test_plan_advice. I think that might cause random breakage of other types, though. Another idea in this category is to try to make the main regression tests "pg_plan_advice clean". I know Tom already expressed opposition to that idea, but here me out: we could (1d) have a separate test suite that still does stuff like this, so we don't lose test coverage, and move some stuff there. Or, instead of completely separating it, we could (1e) have two schedule files, one of which includes all the tests and a second of which includes only the tests that are test_plan_advice-clean. Although my theory that the main regression tests couldn't have multiple different sessions simultaneously doing DDL on the same objects has been proven wrong, I'd still be willing to bet that it's a minority position. Of course, as Tom pointed out, there could be a "long tail" of failures here, but maybe we could create some throwaway infrastructure to help figure that out. For example if we're mostly worried about tables, we could have each backend accumulate a list of table OIDs that it touched and spit that out into the log file when it exits. That wouldn't be committable code, but it would be enough to let us run the regression tests with that once and see what overlaps exist. I bet there's very low risk of newly-added tests adding more such cases: the ones that we have are probably ancient. Of course, maybe I'm wrong about that, too, but it's a theory that we can discuss. In category (2), what if, (2a) whenever we see advice feedback that we'd otherwise print, we try replanning the query a THIRD time without any supplied advice? If we generate different advice than we did the first time we planned it, then we know for sure that something is unstable, and we can decide not to complain about whatever went wrong. This isn't completely guaranteed to work, though: what if concurrent DDL changes something between planning cycle 1 and planning cycle 2 and then changes it back before planning cycle 3? But maybe it would be acceptable to make a rule that the main regression test case shouldn't do that, and adjust cases that currently do to work otherwise. If we're not willing to make any rules at all to prevent the main regression test suite from sabotaging test_plan_advice, then it's probably doomed. And, I think there's a reasonable argument that insisting that the main regression test suite absolutely has to change the definition of an object in a way that test_plan_advice will care about and then change it back to exactly the initial state while in a concurrent session some other backend is running queries against that object is tantamount to legislating deliberate sabotage. But that said, this proposal has some other imperfections as well. In particular, a bug that caused the third planning cycle to always produce different results than the first would hide all future problems that test_plan_advice might have caught, which is pretty sad. Another variant of the same basic idea is to (2b) just detect when we've seen any shared invalidations between the start of the first planning cycle and the end of the second, and go "never mind, don't complain even if we saw a problem". The problem with this idea is that, as in the previous proposal, it might make the tests too insensitive to real issues. But I wonder if this might be fixable. Maybe we could (2c) make test_plan_advice take planner_hook and wrap a loop around the problem: it just keeps replanning the query via standard_planner (which would eventually reach test_plan_advice_advisor) until no sinval messages are absorbed between the start and end of planner, which I think we could detect using SharedInvalidMessageCounter, or until some retry limit is exhausted and we error out. I'd need to try this and see how well it works out in practice, and how often the retry is actually hit, but it seems like it might be somewhat viable. In category (3), the most blunt option is obviously just (3a) throw test_plan_advice away, which I think is probably dooming pg_plan_advice to getting silently broken in the future. I don't really have any other ideas in this category except for (1a) already mentioned, which is sort of a hybrid solution. My current thought is to do some research into (1e) and (2c). Specifically, for (1e), I want to try to figure out if this is the only case of this type or if there are lots of others, since that seems likely to have a pretty large bearing on what is realistic here. And for (2c), I think I just want to try it out and see if it seems at all feasible. Probably obviously, this is not going to happen before next week, but I hope that the frequency of buildfarm failures is now low enough that this isn't a critical issue. If that's wrong, let me know, but from my point of view, even if we eventually chose (3a), having a good a sense as possible of what the potential failure modes are here would help to design the next solution, and AFAIK this is the first failure we've seen since the DO_NOT_SCAN stuff went in. (In fact, I had a little bit of trouble finding this in the BF results even knowing it was there: filtering by test_plan_advice failures doesn't find anything recent. sifaka's failure shows up as TestModulesCheck-en_US.UTF-8, but frustratingly, the names for the stage logs don't seem to quite match the name of what failed. There is testmodules-install-check-C and testmodules-install-check-en_US.UTF-8, but those have "install" in the name and are punctuated differently, so it's not instantly clear that it's the same thing. Anyway, I do see it in there now, but what I'm saying is that if there have been other failures that are related to this, it's possible I have missed them due to stuff like this, so it's helpful that you (Tom) pointed this one out.) Tom, would welcome your thoughts, if you have any, and anyone else's thoughts as well. If none, I'll proceed as described above and update when I know more. Thanks, --=20 Robert Haas EDB: http://www.enterprisedb.com