Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1siFIC-0015DD-7D for pgsql-general@arkaria.postgresql.org; Sun, 25 Aug 2024 15:36:36 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1siFIA-00GKBr-DC for pgsql-general@arkaria.postgresql.org; Sun, 25 Aug 2024 15:36:34 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1siFI9-00GKBj-UR for pgsql-general@lists.postgresql.org; Sun, 25 Aug 2024 15:36:34 +0000 Received: from mail-ej1-x62d.google.com ([2a00:1450:4864:20::62d]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1siFI7-001NXz-U0 for pgsql-general@lists.postgresql.org; Sun, 25 Aug 2024 15:36:33 +0000 Received: by mail-ej1-x62d.google.com with SMTP id a640c23a62f3a-a867a564911so433405866b.2 for ; Sun, 25 Aug 2024 08:36:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1724600190; x=1725204990; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=LFhStU/1K7MFb+PO3QpLnL40Ve57Yplqf554edX6alE=; b=flwKs+T3djB4f4M/bWLkRzGv6LYUlf1OSn5BTA47R/FmxpcphoZ10Jltozj7701ruo Ol5WcYSpyYhTiRQD3mRMhH5Orhri34T7rKeo7tWxfR0vwDNHaWllhUQ0L5rcilWvfoLN 8LWcdnx2LM7BsXRADXGToKKbQncbBHQtYp8//dg4rfIsWrKwoSPN0ST//RFGImNEc+aV LFrx4VxG7meDk2KU6CY7fWBb0ZFLdT6LcGGi2m8ksVaB1yvuHNYtngPpvrbjUSayxvJk 3rsxnrDPCClvtWJce75T8dQGCOxqqlHgx5OK3Dwis+OF0/98g6kR1MpxVYvnLna+J9aZ 0ATw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724600190; x=1725204990; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=LFhStU/1K7MFb+PO3QpLnL40Ve57Yplqf554edX6alE=; b=ObWKqxgLCPz4Sw60hdVxV4osZBy1/cgxXQia7je/RPMTRJlxQGRjwQwi4bnqiuWFoI 4D9qTTpgGTCVUqomqHRIFjUcnxehmwHTAsHrkl5WPt1b3/OdMmYlF6texfOdYXPJdyPF 1oVl8XRHu/cbvEOuFLsI0VlMguL9LnPVbmCOf3jH8nvmal9cXfhWqC3OOmVgykNHrf5q Pfyqv7uAIrDwCz8LSraCnrM4mP0PfnQNndBgAlNi7bs5zt8U0RgpE10jXApO4ftHRp2Z TErMH6cwpcOr5tNsScCNfhb8ffrN5LvxwfIlEi6XhCjHkmk9hX7DlOONg4HfoBVt1uc2 yeAg== X-Gm-Message-State: AOJu0YwPXsiEZrGaFCxgZH7CDkHGXlfVwIMMRrf2I8FHTDWjhUk1HTyO ln7o/s6uGQAwMhRT9OsNgHFiUW+z3mRS1QxU2hk0vB+Pjr82XpkzLAGmWWzLwmvQ/WAwjoiT0M+ WqGtYvD3bIFzAWz2QLGVx1LoEfLw= X-Google-Smtp-Source: AGHT+IGyHhaWfxbzTh8FJXGstKWZuZJXBSGFlm2S1cp0wqIAwicwZZqAcVzkcSjkhxRFxRuUCj464CHM1pV4rjgvo54= X-Received: by 2002:a17:907:3f18:b0:a7d:edb2:21f5 with SMTP id a640c23a62f3a-a86a5165e7dmr539475066b.12.1724600189491; Sun, 25 Aug 2024 08:36:29 -0700 (PDT) MIME-Version: 1.0 References: <73bc7752-442c-4c67-a263-46017f1948cc@vondra.me> In-Reply-To: <73bc7752-442c-4c67-a263-46017f1948cc@vondra.me> From: Marcelo Zabani Date: Sun, 25 Aug 2024 12:36:18 -0300 Message-ID: Subject: Re: ERROR: could not open relation with OID XXXX To: Tomas Vondra Cc: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000aca237062083c51c" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000aca237062083c51c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable > we do some special stuff for catalogs That is good to know, thanks! > I believe you could actually lock the pg_class rows for update. Just add FOR UPDATE at the end of the query. Thanks, but I tried that and got "ERROR: permission denied for table pg_class", even if I try it only for tables the user owns. At least considering the use-case of avoiding this error due to temporary tables/indexes (which are a part of normal application execution), I was thinking of using materialized CTEs that filters those out, and only after that using other functions that for example take OIDs and return definitions. Other kinds of DDL that create non-temporary tables can be "blamed" on developers in my case. Do you think using those materialized CTEs could help? And do you think this can be considered a bug that I should report or is it just too edge-casey to consider? Regards. On Sun, Aug 25, 2024 at 12:06=E2=80=AFPM Tomas Vondra wro= te: > On 8/25/24 15:42, Marcelo Zabani wrote: > > Hi all, > > > > I can reproduce the error in the subject from time to time when queryin= g > > catalog tables while DDL is happening concurrently. Here's a bash scrip= t > > that reproduces it (not always, you might have to run it many times > > until you see ERROR: could not open relation with OID XXXX): > > > > #!/usr/bin/env bash > > psql -c "create table test(x serial primary key); select oid, relname > > from pg_class where relname=3D'test'" > > # The next two queries will run concurrently > > psql -c "select oid, relname, pg_sleep(3), pg_get_indexdef(oid) from > > pg_class join pg_index on indexrelid=3Dpg_class.oid WHERE > > relname=3D'test_pkey';" 2>&1 1>/tmp/pgbug.log & > > sleep 1 > > psql -c "drop table test" > > cat /tmp/pgbug.log > > wait > > > > I am confused as to how this is possible. I assume if the row with the > > test_pkey index exists in the pg_index catalog table, that the snapshot > > of the catalog tables contains the test table itself and is generally > > consistent, so querying the catalog should not run into such errors. > > > > I think you're assuming the whole query runs with a single snapshot, and > AFAIK that's not quite accurate - we do some special stuff for catalogs, > for example. There's also the additional complexity of maintaining a > cache on catalogs, invalidating it, etc. > > I don't have a great simple "this happens because X" explanation, but a > lot of this relies on proper locking - in particular, that we lock all > the objects before execution, which also invalidates all the caches etc. > > But that can't happen here, because we only realize we need to access > the OID very late in the execution, when we get to pg_get_indexdef. > > > I've seen this happen in Production without pg_sleep in the mix, too. I > > added pg_sleep to the example above only because it makes the error > > easier to reproduce. > > > > It's a race condition, essentially. The sleep just makes it easier to > hit, but it can happen without it. > > > Is there something I can do to avoid this? Is my understanding of how > > the catalog tables work wrong? > > > > I believe you could actually lock the pg_class rows for update. Just add > FOR UPDATE at the end of the query. > > > regards > > -- > Tomas Vondra > --000000000000aca237062083c51c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
> we do some special stuff for catalogs
=
That is good to know, thanks!

> = I believe you could actually lock the pg_class rows for update. Just add FO= R UPDATE at the end of the query.

Thanks, but I tr= ied that and got "ERROR: =C2=A0permission denied for table pg_class&qu= ot;, even if I try it only for tables the user owns.

At least considering the use-case of avoiding this error due to temp= orary tables/indexes (which are a part of normal application execution), I = was thinking of using materialized CTEs that filters those out, and only af= ter that using other functions that for example take OIDs and return defini= tions. Other kinds of DDL that create non-temporary tables can be "bla= med" on developers in my case.
Do you think using those mate= rialized CTEs could help? And do you think this can be considered a bug tha= t I should report or is it just too edge-casey to consider?

<= /div>
Regards.

On Sun, Aug 25, 2024 at 12:06=E2=80=AFPM Toma= s Vondra <tomas@vondra.me> wro= te:
On 8/25/24 1= 5:42, Marcelo Zabani wrote:
> Hi all,
>
> I can reproduce the error in the subject from time to time when queryi= ng
> catalog tables while DDL is happening concurrently. Here's a bash = script
> that reproduces it (not always, you might have to run it many times > until you see ERROR: =C2=A0could not open relation with OID XXXX):
>
> #!/usr/bin/env bash
> psql -c "create table test(x serial primary key); select oid, rel= name
> from pg_class where relname=3D'test'"
> # The next two queries will run concurrently
> psql -c "select oid, relname, pg_sleep(3), pg_get_indexdef(oid) f= rom
> pg_class join pg_index on indexrelid=3Dpg_class.oid WHERE
> relname=3D'test_pkey';" 2>&1 1>/tmp/pgbug.log &= amp;
> sleep 1
> psql -c "drop table test"
> cat /tmp/pgbug.log
> wait
>
> I am confused as to how this is possible. I assume if the row with the=
> test_pkey index exists in the pg_index catalog table, that the snapsho= t
> of the catalog tables contains the test table itself and is generally<= br> > consistent, so querying the catalog should not run into such errors. >

I think you're assuming the whole query runs with a single snapshot, an= d
AFAIK that's not quite accurate - we do some special stuff for catalogs= ,
for example. There's also the additional complexity of maintaining a cache on catalogs, invalidating it, etc.

I don't have a great simple "this happens because X" explanat= ion, but a
lot of this relies on proper locking - in particular, that we lock all
the objects before execution, which also invalidates all the caches etc.
But that can't happen here, because we only realize we need to access the OID very late in the execution, when we get to pg_get_indexdef.

> I've seen this happen in Production without pg_sleep in the mix, t= oo. I
> added pg_sleep to the example above only because it makes the error > easier to reproduce.
>

It's a race condition, essentially. The sleep just makes it easier to hit, but it can happen without it.

> Is there something I can do to avoid this? Is my understanding of how<= br> > the catalog tables work wrong?
>

I believe you could actually lock the pg_class rows for update. Just add FOR UPDATE at the end of the query.


regards

--
Tomas Vondra
--000000000000aca237062083c51c--