Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tONmG-005xRe-5Y for pgsql-general@arkaria.postgresql.org; Thu, 19 Dec 2024 21:09:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tONmE-00HVCZ-GK for pgsql-general@arkaria.postgresql.org; Thu, 19 Dec 2024 21:09:46 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tONmE-00HVAO-4M for pgsql-general@lists.postgresql.org; Thu, 19 Dec 2024 21:09:45 +0000 Received: from sss.pgh.pa.us ([68.162.161.243]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tONmB-000RNu-1B for pgsql-general@postgresql.org; Thu, 19 Dec 2024 21:09:44 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTP id 4BJL9cKi436271; Thu, 19 Dec 2024 16:09:38 -0500 From: Tom Lane To: Adrian Klaver cc: Renzo Dani , pgsql-general Subject: Re: Issue with pg_dump due to Schema OID Error In-reply-to: <9560200b-9de3-441b-b381-90c2fc4e78e3@aklaver.com> References: <95d1dc90-42f0-4026-aec1-8e21c5f2a205@aklaver.com> <9560200b-9de3-441b-b381-90c2fc4e78e3@aklaver.com> Comments: In-reply-to Adrian Klaver message dated "Thu, 19 Dec 2024 09:21:52 -0800" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <436269.1734642578.1@sss.pgh.pa.us> Date: Thu, 19 Dec 2024 16:09:38 -0500 Message-ID: <436270.1734642578@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Adrian Klaver writes: > The latest bug fixes of each are 16.6 and 17.2. I would upgrade to those > and then try again. Highly unlikely to make any difference. What's evidently going on here is that the test script attempts to do DROP SCHEMA concurrently with another session that's creating an object inside that schema. (Here, that's a function, but the particular type of object doesn't really matter.) There are three possible outcomes of that: 1. The object creation commits soon enough that DROP SCHEMA sees it, and drops the object along with the schema. 2. The object creation begins after DROP SCHEMA commits, and fails because the schema is not to be found. 3. The object creation goes through, leaving a now-dangling schema OID reference in the object's catalog entry. The object is useless because it's unnamable, but it won't really cause any trouble except for applications that scan the system catalogs (like pg_dump). Exactly none of these outcomes result in a usable object, so one wonders why your application is doing this sort of thing often enough to hit the race condition. We could prevent case 3 by locking the schema during object creation, converting it to one of the other cases. We actually do that for tables, but not for any other object types, reasoning that the greatly increased cost of locking would outweigh the problems that dangling objects create. (Note that to eliminate the issue fully, we'd have to lock every referenced object not only schemas; for example, also the data types of the function's arguments and result.) Also, adding such locking might well lead to deadlocks in concurrent add/drop scenarios, not just performance costs. tl;dr: it's been like this a long time, and I don't really foresee us accepting the costs of making it not act like that. I seem to recall someone submitting a patch recently that would add such locking, but I doubt it'll get accepted. regards, tom lane