Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8FpC-000Lru-0f for pgsql-hackers@arkaria.postgresql.org; Thu, 02 Apr 2026 11:02:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8FpA-005KkE-2U for pgsql-hackers@arkaria.postgresql.org; Thu, 02 Apr 2026 11:02:57 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8FpA-005Kk1-0P for pgsql-hackers@lists.postgresql.org; Thu, 02 Apr 2026 11:02:56 +0000 Received: from mail-oo1-xc2e.google.com ([2607:f8b0:4864:20::c2e]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w8Fp7-00000000AcR-3T96 for pgsql-hackers@lists.postgresql.org; Thu, 02 Apr 2026 11:02:55 +0000 Received: by mail-oo1-xc2e.google.com with SMTP id 006d021491bc7-67c22b05346so429099eaf.2 for ; Thu, 02 Apr 2026 04:02:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775127773; cv=none; d=google.com; s=arc-20240605; b=LUxpJM1Ww3KpguP2T0MVtIabT36o6ASK9/dgZmLt6rzexUqx+VXJJTZFtOxl/wHzNk XGv6XUCfg3Jr8Gt0gFiHnG8qukYbFIaHAuqghEZ4IpeFQVk8aBIZn2LIlY9qm3iv5UOi D7fhpFND852dopWif8SykgbCv29JC+Ca+85uuiWCmJHbIJexp5P3NkogMHeooJNrLnr7 L5pxIjf9yqrrrzFcgKw5GMnYyH93v6w0EiSK+4HkHqVVJgkDbvqBpXB/U4BME6Scjf2h JUDQjjlKSRukKSEu2ZSuZ2l9Tw4rc5TzeBIaarB2Ya2YAycwrbeEvqDuex6ECkeIPAgm n7TA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=9OVKoDaf2wixL2nFVJe8SdIiYx6qC280s6SMbTpVNTU=; fh=DAsFhCxkxtHPXi7VT1hLRjFbGI1er/rCPBBrlhcwxuU=; b=Bd+/Ht4HcOc02TSgUUIg0dlQu7aV2lXN9+h71Og1zkjtpO+AS8lSWVSdm5m99ABDHL 4BBFbq6ksdjUxcDodAjxs1P5CqttuRfjbCcXT+MIJdLDqBwrkGwdNazDWgLq7QE09V6I wyQbrirTNhFvdeOWIAZACVU/uynbHHcXlbFn5EqoQn9ch3pQ5SOX7YqofvtzCgGf+UO/ OOJRFUc9uVBidMVLQcY5AgBsTtyUlsmazUG9JBPsz/hhnCi2+6CNIkspptLAqJr/ms6x acKdkwqCEPb+W1GAeoeIuh3OX3/z9sBfFqctUj14Bt20Q6XeLfvTFDwv23iYktlX+B/W 86xA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775127773; x=1775732573; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=9OVKoDaf2wixL2nFVJe8SdIiYx6qC280s6SMbTpVNTU=; b=M8FOqpryHwlwVc6Qit0uPj/e09oMgoU8b6UXDWSEO5ya7lta6QmR+lx4Bc8+MEPM/L lo8ZYRxLfDpGn84f9E69GFmZ/2iWiWQK1FdaplGG0EhHnvHNpQq224ORxGxkGWX68N6Q EmOv44j+0l6iLY7x7493CP/eJk7jJK1IkBZ0FVJ8QAAdtN/H2TsftfMOmcYy0iLUZMgR 0nEbgjysVtV201Fof+L+/zyTRD9ocCOSj6ds0q6Dltq4V0EFw1xKJXwbM6+miDreXSvw wUkTOU2LNOnC7WrSIeMSvXQfDqmT8LkeQqIByGe1Kb4IsWBFhWM38gdyNH+Wet/5G4LP 1wJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775127773; x=1775732573; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=9OVKoDaf2wixL2nFVJe8SdIiYx6qC280s6SMbTpVNTU=; b=snTqipMEDj1z/4aQW464koTcbfjT/U664QloTkaOhejrAWuPS/KgMcw33Ii7NeEExr zh0lK2M2flxnuRLnqVDCBE5g5etLxfisnBy28wiEPPBf9KsFyHs7/+I9Jk3dWtZNDGKl 6CN6X1U5tdEvslyKFC6wJRJgQwo6FMNTFZ1Ail/+mOLnr9z/20RCXj6jtg8FGBszirLY pRnZnE1QcJIEFBl6nq4CxVdZ7jQD7W5xYZ3sYTgUQP9as3bbJoL9gyqSQkIr83Kb75BT 7zJg3qsQ+TLk5Itneg+icEEDSBKlt74acBJ6I7kr1wbYdVb72V0Ooe+Pvgsk7N8F+wmD rZCQ== X-Forwarded-Encrypted: i=1; AJvYcCUKk2C01gTxe+xbcl+Dc37BEUCaI+BV/dyP0+zHFsAPGjMc22q+KEXddhkMml4udHnlVvIaP2WiH11cZJB/@lists.postgresql.org X-Gm-Message-State: AOJu0YxpQe1qJ5A9HwRgmTewCQhV7ZbO15rPoQ6896iQkgKUP7trvsuc f3QjK4k4hTO+2j8XlFyXe7punPONJPNhzYxKserXosoasdwXevhQarMCB/ieMoSy64lEauo6L8+ BU7VEKwSAYMLbH6fxhjTsirupWYFj5uk= X-Gm-Gg: ATEYQzzn23W4ciX28YnK83C4kJ0NFEwiS6lFLvRjxq5/MUroT0YyTt5J0mGpLjNIhqn RGrYfiLu8s4rSdniYuK/gX+aPF5HJ16QdFQ0BWoe9A3EDfGru6vYw6lbx07rLrqYno3wBLxgaBT C3XZZp1rIer37rhBKiO+7OG+OarRDBzO3bkj6jIOu7musO/wtdZdtMgGj7/Ofbimzb8VTYSmPbU DE9r1yukNqznBW0+s4x19ZfJsFV4I+9vuj51akijreouvEnVx6DLxH2+hxqFF/af6mtT6QklWA+ 2rXgm0qh9jpaMCzthh6x5vcVxsMnjRjKzqC6BjR9WUtmI01O7nw4rbRpqv+ACTaHQHQOSjm+CTw AingW8Q== X-Received: by 2002:a05:6820:222a:b0:67f:abdb:834d with SMTP id 006d021491bc7-680a60d9020mr1566293eaf.22.1775127773291; Thu, 02 Apr 2026 04:02:53 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Alexander Korotkov Date: Thu, 2 Apr 2026 14:02:40 +0300 X-Gm-Features: AQROBzCbMEsaQxUEQVPoBz4OneXPOl6oimDTE10Orttx0hiopdl5mHNV0bL8zuc Message-ID: Subject: Re: POC: Parallel processing of indexes in autovacuum To: Masahiko Sawada Cc: SATYANARAYANA NARLAPURAM , Daniil Davydov <3danissimo@gmail.com>, Bharath Rupireddy , Sami Imseih , Matheus Alcantara , Maxim Orlov , Postgres hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi! On Wed, Apr 1, 2026 at 9:55=E2=80=AFPM Masahiko Sawada wrote: > > On Mon, Mar 30, 2026 at 5:14=E2=80=AFPM SATYANARAYANA NARLAPURAM > wrote: > > > > Hi > > > > On Mon, Mar 30, 2026 at 1:44=E2=80=AFAM Daniil Davydov <3danissimo@gmai= l.com> wrote: > >> > >> Hi, > >> > >> On Mon, Mar 30, 2026 at 7:17=E2=80=AFAM SATYANARAYANA NARLAPURAM > >> wrote: > >> > > >> > Thank you for working on this, very useful feature. Sharing a few th= oughts: > >> > > >> > 1. Shouldn't we also cap by max_parallel_workers to avoid wasting DS= M resources in parallel_vacuum_compute_workers? > >> > >> Actually, autovacuum_max_parallel_workers is already limited by > >> max_parallel_workers. It is not clear for me why we allow setting this= GUC > >> higher than max_parallel_workers, but if this happens, I think it is a= user's > >> misconfiguration. > >> > >> > 2. Is it intentional that other autovacuum workers not yield cost li= mits to the parallel auto vacuum workers? Cost limits are distributed first= equally to the autovacuum workers. > >> > and then they share that. Therefore, parallel workers will be heavil= y throttled. IIUC, this problem doesn't exist with manual vacuum. > >> > If we don't fix this, at least we should document this. > >> > >> Parallel a/v workers inherit cost based parameters (including the > >> vacuum_cost_limit) from the leader worker. Do you mean that this can b= e too > >> low value for parallel operation? If so, user can manually increase th= e > >> vacuum_cost_limit reloption for those tables, where parallel a/v sleep= s too > >> much (due to cost delay). > >> > >> BTW, describing the cost limit propagation to the parallel a/v workers= is > >> worth mentioning in the documentation. I'll add it in the next patch v= ersion. > >> > >> > 3. Additionally, is there a point where, based on the cost limits, l= aunching additional workers becomes counterproductive compared to running f= ewer workers and preventing it? > >> > >> I don't think that we can possibly find a universal limit that will be > >> appropriate for all possible configurations. By now we are using a pre= tty > >> simple formula for parallel degree calculation. Since user have severa= l ways > >> to affect this formula, I guess that there will be no problems with it= (except > >> my concerns about opt-out style). > >> > >> > 4. Would it make sense to add a table level override to disable para= llelism or set parallel worker count? > >> > >> We already have the "autovacuum_parallel_workers" reloption that is us= ed as > >> an additional limit for the number of parallel workers. In particular,= this > >> reloption can be used to disable parallelism at all. > >> > >> > > >> > I ran some perf tests to show the improvements with parallel vacuum = and shared below. > >> > >> Thank you very much! > >> > >> > Observations: > >> > > >> > 1. Parallel autovacuum provides consistent speedup. With cost_limit= =3D200 and > >> > 7 workers, vacuum completes 1.41x faster (71s -> 50s). With cost_= limit=3D60, > >> > the speedup is 1.25x (194s -> 154s). > >> > 2. I see the benefit comes from parallelizing index vacuum. With 8 i= ndexes totaling > >> > ~530 MB, parallel workers scan indexes concurrently instead of th= e leader > >> > scanning them one by one. The leader's CPU user time drops from ~= 3s to > >> > ~0.8s as index work is offloaded > >> > > >> > >> 1.41 speedup with 7 parallel workers may not seem like a great win, bu= t it is > >> a whole time of autovacuum operation (not only index bulkdel/cleanup) = with > >> pretty small indexes. > >> > >> May I ask you to run the same test with a higher table's size (several= dozen > >> gigabytes)? I think the results will be more "expressive". > > > > > > I ran it with a Billion rows in a table with 8 indexes. The improvement= with 7 workers is 1.8x. > > Please note that there is a fixed overhead in other vacuum steps, for e= xample heap scan. > > In the environments where cost-based delay is used (the default), benef= its will be modest > > unless vacuum_cost_delay is set to sufficiently large value. > > > > Hardware: > > CPU: Intel Xeon Platinum 8573C, 1 socket =C3=97 8 cores =C3=97 2 = threads =3D 16 vCPUs > > RAM: 128 GB (131,900 MB) > > Swap: None > > > > Workload Description > > > > Table Schema: > > CREATE TABLE avtest ( > > id bigint PRIMARY KEY, > > col1 int, -- random()*1e9 > > col2 int, -- random()*1e9 > > col3 int, -- random()*1e9 > > col4 int, -- random()*1e9 > > col5 int, -- random()*1e9 > > col6 text, -- 'text_' || random()*1e6 (short text ~= 10 chars) > > col7 timestamp, -- now() - random()*365 days > > padding text -- repeat('x', 50) > > ) WITH (fillfactor =3D 90); > > > > Indexes (8 total): > > avtest_pkey =E2=80=94 btree on (id) bigint > > idx_av_col1 =E2=80=94 btree on (col1) int > > idx_av_col2 =E2=80=94 btree on (col2) int > > idx_av_col3 =E2=80=94 btree on (col3) int > > idx_av_col4 =E2=80=94 btree on (col4) int > > idx_av_col5 =E2=80=94 btree on (col5) int > > idx_av_col6 =E2=80=94 btree on (col6) text > > idx_av_col7 =E2=80=94 btree on (col7) timestamp > > > > Dead Tuple Generation: > > DELETE FROM avtest WHERE id % 5 IN (1, 2); > > This deletes exactly 40% of rows, uniformly distributed across all pa= ges. > > > > Vacuum Trigger: > > Autovacuum is triggered naturally by lowering the threshold to 0 and = setting > > scale_factor to a value that causes immediate launch after the DELETE= . > > > > Worker Configurations Tested: > > 0 workers =E2=80=94 leader-only vacuum (baseline, no parallelism) > > 2 workers =E2=80=94 leader + 2 parallel workers (3 processes total) > > 4 workers =E2=80=94 leader + 4 parallel workers (5 processes total) > > 7 workers =E2=80=94 leader + 7 parallel workers (8 processes total, = 1 per index) > > > > Dataset: > > Rows: 1,000,000,000 > > Heap size: 139 GB > > Total size: 279 GB (heap + 8 indexes) > > Dead tuples: 400,000,000 (40%) > > > > Index Sizes: > > avtest_pkey 21 GB (bigint) > > idx_av_col7 21 GB (timestamp) > > idx_av_col1 18 GB (int) > > idx_av_col2 18 GB (int) > > idx_av_col3 18 GB (int) > > idx_av_col4 18 GB (int) > > idx_av_col5 18 GB (int) > > idx_av_col6 7 GB (text =E2=80=94 shorter keys, smaller index) > > Total indexes: 139 GB > > > > Server Settings: > > shared_buffers =3D 96GB > > maintenance_work_mem =3D 1GB > > max_wal_size =3D 100GB > > checkpoint_timeout =3D 1h > > autovacuum_vacuum_cost_delay =3D 0ms (NO throttling) > > autovacuum_vacuum_cost_limit =3D 1000 > > > > > > Summary: > > > > Workers Avg(s) Min(s) Max(s) Speedup Time Saved > > ------- ------ ------ ------ ------- ---------- > > 0 1645.93 1645.01 1646.84 1.00x =E2=80=94 > > 2 1276.35 1275.64 1277.05 1.29x 369.58s (6.2 min) > > 4 1052.62 1048.92 1056.32 1.56x 593.31s (9.9 min) > > 7 892.23 886.59 897.86 1.84x 753.70s (12.6 min) > > > > Thank you for sharing the performance test results! > > While the benchmark results look good to me, have you compared the > performance differences between parallel vacuum in the VACUUM command > (with the PARALLEL option) and parallel vacuum in autovacuum? Since > parallel autovacuum introduces some logic to check for delay parameter > updates, I thought it was worth verifying if this adds any overhead. > > BTW, in my view, the most challenging part of this patch is the > propagation logic for vacuum delay parameters. This propagation is > necessary because, unlike manual VACUUM, autovacuum workers can reload > their configuration during operation. We must ensure that parallel > workers stay synchronized with these updated parameters. > > The current patch implements this in vacuumparallel.c: the leader > shares delay parameters in DSM and updates them (if any vacuum delay > parameters are updated) after a config reload, while workers poll for > updates at every vacuum_delay_point() call to refresh their local > variables. > > Another possible approach would be an event-driven model where the > leader notifies workers after updating shared parameters=E2=80=94for exam= ple, > by adding a shm_mq between the leader (as the sender) and each worker > (as the receiver). > > I've compared these two ideas and opted for the former (polling). > While a polling approach could theoretically be costly, the current > implementation is self-contained within the parallel vacuum logic and > does not touch the core parallel query infrastructure. The > notification approach might look more elegant, but I'm concerned it > adds unnecessary complexity just for the autovacuum case. Since the > polling is essentially just checking an atomic variable, the overhead > should be negligible. > > To verify this, I conducted benchmarks comparing the whole execution > time and index vacuuming duration. > > Setup: > > - Disabled (auto) vacuum delays and buffer usage limits. > - Parallel autovacuum with 1 worker on a table with 2 indexes (approx. > 4 GB each). > - 5 runs. > > Case 1: The latest patch (with polling) > > Average: 3.95s (Index: 1.54s) > Median: 3.62s (Index: 1.37s) > > Case 2: The latest patch without polling > > Average: 3.98s (Index: 1.56s) > Median: 3.70s (Index: 1.40s) > > Note that in order to simulate the code that doesn't have the polling, > I reverted the following change: > > - if (InterruptPending || > - (!VacuumCostActive && !ConfigReloadPending)) > + if (InterruptPending) > + return; > + > + if (IsParallelWorker()) > + { > + /* > + * Update cost-based vacuum delay parameters for a parallel autov= acuum > + * worker if any changes are detected. > + */ > + parallel_vacuum_update_shared_delay_params(); > + } > + > + if (!VacuumCostActive && !ConfigReloadPending) > > The parallel vacuum workers don't check the shared vacuum delay > parameter at all, which is still fine as I disabled vacuum delays. > > Overall, the results show no noticeable overhead from the polling approac= h. I would say this polling approach is very cheap. When there are no updates, it only has to check a single 32-bit value from shared memory. And that value doesn't get updated frequently; it's good for caching. No wonder we see no measurable overhead. Regarding the event-driven approach, given that the parallel worker process is busy with other jobs (doing actual vacuuming), it would anyway have to poll for new events from time to time. Thus, I don't think it's possible to organize polling for new events any cheaper than the current approach of polling for updates in shmem. If the worker process was just waiting for GUC updates without any other jobs, then, for instance, waiting on the latch would be cheaper than polling in a loop, but that's not our case. I don't see the current polling approach for GUC updates as performance problematic. ------ Regards, Alexander Korotkov Supabase