public inbox for [email protected]  
help / color / mirror / Atom feed
From: Tomas Vondra <[email protected]>
To: Marcelo Zabani <[email protected]>
To: [email protected]
Subject: Re: ERROR: could not open relation with OID XXXX
Date: Sun, 25 Aug 2024 17:06:46 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <CACgY3QYe3mqMkODOzwx=EN4iGoddObUotLL8hGnCg==3vmxd_g@mail.gmail.com>
References: <CACgY3QYe3mqMkODOzwx=EN4iGoddObUotLL8hGnCg==3vmxd_g@mail.gmail.com>

On 8/25/24 15:42, Marcelo Zabani wrote:
> Hi all,
> 
> I can reproduce the error in the subject from time to time when querying
> catalog tables while DDL is happening concurrently. Here's a bash script
> that reproduces it (not always, you might have to run it many times
> until you see ERROR:  could not open relation with OID XXXX):
> 
> #!/usr/bin/env bash
> psql -c "create table test(x serial primary key); select oid, relname
> from pg_class where relname='test'"
> # The next two queries will run concurrently
> psql -c "select oid, relname, pg_sleep(3), pg_get_indexdef(oid) from
> pg_class join pg_index on indexrelid=pg_class.oid WHERE
> relname='test_pkey';" 2>&1 1>/tmp/pgbug.log &
> sleep 1
> psql -c "drop table test"
> cat /tmp/pgbug.log
> wait
> 
> I am confused as to how this is possible. I assume if the row with the
> test_pkey index exists in the pg_index catalog table, that the snapshot
> of the catalog tables contains the test table itself and is generally
> consistent, so querying the catalog should not run into such errors.
> 

I think you're assuming the whole query runs with a single snapshot, and
AFAIK that's not quite accurate - we do some special stuff for catalogs,
for example. There's also the additional complexity of maintaining a
cache on catalogs, invalidating it, etc.

I don't have a great simple "this happens because X" explanation, but a
lot of this relies on proper locking - in particular, that we lock all
the objects before execution, which also invalidates all the caches etc.

But that can't happen here, because we only realize we need to access
the OID very late in the execution, when we get to pg_get_indexdef.

> I've seen this happen in Production without pg_sleep in the mix, too. I
> added pg_sleep to the example above only because it makes the error
> easier to reproduce.
> 

It's a race condition, essentially. The sleep just makes it easier to
hit, but it can happen without it.

> Is there something I can do to avoid this? Is my understanding of how
> the catalog tables work wrong?
> 

I believe you could actually lock the pg_class rows for update. Just add
FOR UPDATE at the end of the query.


regards

-- 
Tomas Vondra






view thread (8+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: ERROR: could not open relation with OID XXXX
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox