Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sc3b8-003WxG-OZ for pgsql-hackers@arkaria.postgresql.org; Thu, 08 Aug 2024 13:54:34 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sc3b6-00EXzG-Mh for pgsql-hackers@arkaria.postgresql.org; Thu, 08 Aug 2024 13:54:32 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sc3b6-00EXz8-Bh for pgsql-hackers@lists.postgresql.org; Thu, 08 Aug 2024 13:54:32 +0000 Received: from mail-ed1-x531.google.com ([2a00:1450:4864:20::531]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sc3b3-003o8f-J2 for pgsql-hackers@postgresql.org; Thu, 08 Aug 2024 13:54:31 +0000 Received: by mail-ed1-x531.google.com with SMTP id 4fb4d7f45d1cf-5bb8e62570fso1186332a12.1 for ; Thu, 08 Aug 2024 06:54:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723125269; x=1723730069; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=mkJLhH1nzQA0JGf7n67xouPxJX1xmhCkJnGnsSo0akw=; b=Gv9lBpIkhpC7ZdbEW8U8DWnW3eflEiJFTIZev9+uPSLoTHV+7cXf9FZFsi14UZJ+wn 2E6XeUXKyu3oUsW2dZipVqnx0TMhRP79BU4PFgLvqzy/Vsn2m4stHc819xu69qQMRPRK l3pAwwF10CxRz9gVjiWOUfwNRkDMGOj//hCAmSGNfkWuAXsNWaaid77mXnvbeVfmrdyq 2jGHPX7sgRHZ581ipPxGz7UIioN9NNxStczdt51xV0MEZlZySTYGlqP7YN1QzgQ70a+D 5ky2Bcyl7eT+gr0MDMQBn5Dd9rWEwDFrgaHRUg9nFQ/+XcvQ/DwA8XnUqtD3G396kBNH 09Sg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723125269; x=1723730069; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=mkJLhH1nzQA0JGf7n67xouPxJX1xmhCkJnGnsSo0akw=; b=pf4RRE2cwyIoTOGP+wqR6ptDUPe55huzDhsi7k1LIeZ+0ZlLqujBpRb76EtJJg/3Pq PmXZvpZrhNUxbM9QRf9YWw7xleI8AewyVxxfFXo1tc4Ttqr6DHkwSub4Oz9q+tuqcdsW HyEpYndkc/TsbP/U+K+c1uyGY0Z07juySAJ9brpXhOVaw0TTr3kbOwmG4sbPkZTYJ/wQ e3fAEBeYZVzmxszyLKbQZDIe5tFZ2eDsC0487WH3gBz6rS+TVFyGfiZDfLylnUvJsRF+ zzcjbNtzIl9lgOZttSTEVY6SMV5G8JFooqVkhK/khUtbQo5I7x75ArE5kEEfVbVikJFz 62iQ== X-Gm-Message-State: AOJu0YwsxaVUyaql3+ZTgd4/WW1tC81BhX/o1Ec04muvgZuxfPy5W6PK pg6wzwLT286NR4rL9CkEogS5TNIu752kwV7/H6Ta5aA4vA3hWsOrnaw6g+yH0krZq7mMUF6a5ui mltSq1JLpbbDjZzO+jFrW3ZyFq00= X-Google-Smtp-Source: AGHT+IFCtbAWb/C13KlOZm26hoxflliULmH6vTTn+HZPIjUFSJ0c7XvqHmZFQNeUZWWceKnfQG3LI3zh5udnyE3KjE4= X-Received: by 2002:a17:907:f147:b0:a77:db97:f4fd with SMTP id a640c23a62f3a-a8090d93da3mr139787966b.34.1723125268411; Thu, 08 Aug 2024 06:54:28 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Michail Nikolaev Date: Thu, 8 Aug 2024 15:53:00 +0200 Message-ID: Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements To: Matthias van de Meent Cc: PostgreSQL Hackers , Andrey Borodin , Melanie Plageman Content-Type: multipart/alternative; boundary="000000000000870355061f2c5df7" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000870355061f2c5df7 Content-Type: text/plain; charset="UTF-8" Hello, Matthias! > While waiting for this, here are some initial comments on the github diffs: Thanks for your review! While stress testing the POC, I found some issues unrelated to the patch that need to be fixed first. This is [1] and [2]. >> Additional index is lightweight and does not produce any WAL. > That doesn't seem to be what I see in the current patchset: Persistence is passed as parameter [3] and set to RELPERSISTENCE_UNLOGGED for auxiliary indexes [4]. > - I notice you've added a new argument to > heapam_index_build_range_scan. I think this could just as well be > implemented by reading the indexInfo->ii_Concurrent field, as the > values should be equivalent, right? Not always; currently, it is set by ResetSnapshotsAllowed[5]. We fall back to regular index build if there is a predicate or expression in the index (which should be considered "safe" according to [6]). However, we may remove this check later. Additionally, there is no sense in resetting the snapshot if we already have an xmin assigned to the backend for some reason. > In heapam_index_build_range_scan, it seems like you're popping the > snapshot and registering a new one while holding a tuple from > heap_getnext(), thus while holding a page lock. I'm not so sure that's > OK, expecially when catalogs are also involved (specifically for > expression indexes, where functions could potentially be updated or > dropped if we re-create the visibility snapshot) Yeah, good catch. Initially, I implemented a different approach by extracting the catalog xmin to a separate horizon [7]. It might be better to return to this option. > In heapam_index_build_range_scan, you pop the snapshot before the > returned heaptuple is processed and passed to the index-provided > callback. I think that's incorrect, as it'll change the visibility of > the returned tuple before it's passed to the index's callback. I think > the snapshot manipulation is best added at the end of the loop, if we > add it at all in that function. Yes, this needs to be fixed as well. > The snapshot reset interval is quite high, at 500ms. Why did you > configure it that low, and didn't you make this configurable? It is just a random value for testing purposes. I don't think there is a need to make it configurable. Getting a new snapshot is a cheap operation now, so we can do it more often if required. Internally, I was testing it with a 0ms interval. > You seem to be using WAL in the STIR index, while it doesn't seem > that relevant for the use case of auxiliary indexes that won't return > any data and are only used on the primary. It would imply that the > data is being sent to replicas and more data being written than > strictly necessary, which to me seems wasteful. It just looks like an index with WAL, but as mentioned above, it is unlogged in actual usage. > The locking in stirinsert can probably be improved significantly if > we use things like atomic operations on STIR pages. We'd need an > exclusive lock only for page initialization, while share locks are > enough if the page's data is modified without WAL. That should improve > concurrent insert performance significantly, as it would further > reduce the length of the exclusively locked hot path. Hm, good idea. I'll check it later. Best regards & thanks again, Mikhail [1]: https://www.postgresql.org/message-id/CANtu0ohHmYXsK5bxU9Thcq1FbELLAk0S2Zap0r8AnU3OTmcCOA%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CANtu0ojga8s9%2BJ89cAgLzn2e-bQgy3L0iQCKaCnTL%3Dppot%3Dqhw%40mail.gmail.com [3]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach#diff-50abc48efcc362f0d3194aceba6969429f46fa1f07a119e555255545e6655933R93 [4]: https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L1600 [5]: https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L2657 [6]: https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/commands/indexcmds.c#L1129 [7]: https://github.com/postgres/postgres/commit/38b243d6cc7358a44cb1a865b919bf9633825b0c --000000000000870355061f2c5df7 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello, Matthias!

> While waiting for this, here = are some initial comments on the github diffs:

Thanks for your revie= w!
While stress testing the POC, I found some issues unrelated to the pa= tch that need to be fixed first.
This is [1] and [2].

>> Ad= ditional index is lightweight and does not produce any WAL.
> That do= esn't seem to be what I see in the current patchset:

Persistenc= e is passed as parameter [3] and set to RELPERSISTENCE_UNLOGGED for auxilia= ry indexes [4].

> - I notice you've added a new argument to> heapam_index_build_range_scan. I think this could just as well be> implemented by reading the indexInfo->ii_Concurrent field, as the<= br>> values should be equivalent, right?

Not always; currently, i= t is set by ResetSnapshotsAllowed[5].
We fall back to regular index buil= d if there is a predicate or expression in the index (which should be consi= dered "safe" according to [6]).
However, we may remove this ch= eck later.
Additionally, there is no sense in resetting the snapshot if = we already have an xmin assigned to the backend for some reason.

>= ; In heapam_index_build_range_scan, it seems like you're popping the> snapshot and registering a new one while holding a tuple from
>= heap_getnext(), thus while holding a page lock. I'm not so sure that&#= 39;s
> OK, expecially when catalogs are also involved (specifically f= or
> expression indexes, where functions could potentially be updated= or
> dropped if we re-create the visibility snapshot)

Yeah, g= ood catch.
Initially, I implemented a different approach by extracting t= he catalog xmin to a separate horizon [7]. It might be better to return to = this option.

> In heapam_index_build_range_scan, you pop the snap= shot before the
> returned heaptuple is processed and passed to the i= ndex-provided
> callback. I think that's incorrect, as it'll = change the visibility of
> the returned tuple before it's passed = to the index's callback. I think
> the snapshot manipulation is b= est added at the end of the loop, if we
> add it at all in that funct= ion.

Yes, this needs to be fixed as well.

> The snapshot r= eset interval is quite high, at 500ms. Why did you
> configure it tha= t low, and didn't you make this configurable?

It is just a rando= m value for testing purposes.
I don't think there is a need to make = it configurable.
Getting a new snapshot is a cheap operation now, so we = can do it more often if required.
Internally, I was testing it with a 0m= s interval.

> You seem to be using WAL in the STIR index, while i= t doesn't seem
> that relevant for the use case of auxiliary inde= xes that won't return
> any data and are only used on the primary= . It would imply that the
> data is being sent to replicas and more d= ata being written than
> strictly necessary, which to me seems wastef= ul.

It just looks like an index with WAL, but as mentioned above, it= is unlogged in actual usage.

> The locking in stirinsert can pro= bably be improved significantly if
> we use things like atomic operat= ions on STIR pages. We'd need an
> exclusive lock only for page i= nitialization, while share locks are
> enough if the page's data = is modified without WAL. That should improve
> concurrent insert perf= ormance significantly, as it would further
> reduce the length of the= exclusively locked hot path.

Hm, good idea. I'll check it later= .

Best regards & thanks again,
Mikhail

[1]: https://www.postgresql.= org/message-id/CANtu0ohHmYXsK5bxU9Thcq1FbELLAk0S2Zap0r8AnU3OTmcCOA%40mail.g= mail.com
[2]: https://www.postgresql.org/message-id/CANtu0ojga8s9%2BJ89cAgL= zn2e-bQgy3L0iQCKaCnTL%3Dppot%3Dqhw%40mail.gmail.com
[3]: https://github.com/po= stgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concur= rently_approach#diff-50abc48efcc362f0d3194aceba6969429f46fa1f07a119e5552555= 45e6655933R93
[4]: https://github.com/michail-nikolaev/postgres= /blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#= L1600
[5]: https://github.com/michail-nikolaev/postgres/blob/e2= 698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L2657
[6]:
https://github.com/michail-nikolaev/postgres/blob/e2698= ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/commands/indexcmds.c#L1129<= /a>
[7]:
https://github.com/post= gres/postgres/commit/38b243d6cc7358a44cb1a865b919bf9633825b0c
--000000000000870355061f2c5df7--