Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w80ia-0007tK-0g for pgsql-hackers@arkaria.postgresql.org; Wed, 01 Apr 2026 18:55:08 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w80iY-0021F5-3C for pgsql-hackers@arkaria.postgresql.org; Wed, 01 Apr 2026 18:55:07 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w80iY-0021Ek-1U for pgsql-hackers@lists.postgresql.org; Wed, 01 Apr 2026 18:55:07 +0000 Received: from mail-pj1-x102b.google.com ([2607:f8b0:4864:20::102b]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w80iW-000000003i2-2QSy for pgsql-hackers@lists.postgresql.org; Wed, 01 Apr 2026 18:55:06 +0000 Received: by mail-pj1-x102b.google.com with SMTP id 98e67ed59e1d1-35d9923eec5so18936a91.2 for ; Wed, 01 Apr 2026 11:55:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775069704; cv=none; d=google.com; s=arc-20240605; b=Ufag7cMTpQWV8LY0eWdBdbHEhLcZyNnXY7y/kVcIYWyySJxIpCdtMS3vIxyheVA1LG 3Wy7q9XzABiTOeP2S8TKiOrAPMeqKjGctYCIhj03lvznjLQG0bA2qDYllTlbuS1R29qG h59kEUpIo8IlUXf+WNwBZAWe565h+UKFR8dsxFx4sjM2jZy4oFM0TiSVHm2CFsJLTi9T cDAL8FQeMPWv3fzqeb7xhbGGLS/WLDef5llBr6SYzIK72tJm1tJ0X5QK7abAVnQCJhHF iQaMHiQGmjGk4qeNamWwgkaSf7f/JgefxBIRZAbizUfHOGJp6LoTjnWyqsbkQEqf3FHh H3iA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=qSlvcCLQhakTznaQCyrhEaa+AIqfJXd87rF7TwaXpwo=; fh=VMHGn1xsJ1jVF3Z+Z6GyeQ5sNK6jxOtXZB9EQ0AnNIA=; b=AhrXgFbUmq4CMglyHGe4ZPYQ6CVl9FsvgOkmPlntcGVg7Px3QilNunsAvbNNyO3ez6 Xc7lzBAJMSIlaALryWMi1z+e32E8h2k06xTKLviV608av813lO7qWzjUjwWBtyZp3AVD MGqdurYg5SGEUigM5Ds4XPufhmS1kFK1xDjSEXTLvp+F9/4uG37Sou2R0nBWl147919W 3W0snoPMLbYdlS+C62FzWuW6TwzPLXvKo6Yfg3+hYTtCJLLuofNYkQlZNyr912eQmneM aOz8v/+U6sKsTcjztoH7LruJswZ0QJ9JY85HuB7uvP7PvNNyBC3npN6D9lckIHYg/tcR blmw==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775069704; x=1775674504; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qSlvcCLQhakTznaQCyrhEaa+AIqfJXd87rF7TwaXpwo=; b=WNadoUu5QuwiUOerPrC+mR/tDuhOCnlkryHOfrvEdCek7CpNn7pBlmiXUrHRbfKLm4 6Wv34bE/Am5RHnPNOfNSaDP4uMmfU/JOGVrDzbq7ui1vt0t4nK87UYIvzd1/XO5XIR4j WbjPej2lwFfPBl4M3/i3bZKwbolajkzd8EAkXLOso3ylt/Fqc640S8zflmP1duzCobiS fTp/cvmRawzxMPKfD0syLPCHCpNuFoM3i0fnENdGy6pSgzzy8UCtfT1rJVWj7JOR71Ge E48lRfY3c66OOzwUBW0tQjE8VnnGhIw48KS3F0ONMTovgSnLb/UZHO9LescLnZ+IANe8 3vQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775069704; x=1775674504; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=qSlvcCLQhakTznaQCyrhEaa+AIqfJXd87rF7TwaXpwo=; b=cKLSzAQg+5XJ4i7Rj7/FRUA8x2w4Ckp7ZR5dqVcRi8F/xVv8gTGR28+9sfB9oVty4p ptzs7Gy3dxHI2cMCXHp3EQkK6X7BljdLLwwOHGApTR00/okeT1TuHgHamd+TN8OSECsS sv9Enz+5uX7XYWvWRN1slQKGJP7bXByssQzDdVyNd3mPUdjeyDP7qb3wNxdhHzEfk9fm /SBD50qVdWZgv5f05BWNdEAjX3CnBpph7SVxC/tCiXz8+6Nw/30XH37Ab/36iJUnBQjj u+n3dv4cg83OGEpqTDwvazjJvzvvSEdAuyGlfTyy9OiBYvE9O5RmEWsDFs49gqsTiI3T KA8w== X-Forwarded-Encrypted: i=1; AJvYcCXohc1QoHmjWdUqrWusccNrb6QIde7sZ6n3ZUzwJlWEZzhRqnkc49nQaxAeyIX95hZqp2YSViqIbwXg12Hh@lists.postgresql.org X-Gm-Message-State: AOJu0YwG0ZGGXDOHiqBHpqlAW9/d5wSOfGlkRIqM+z61L151f2kfcFiI CDV3gTxLZ/9+XHwfPkLquXbsNnvdD9y1C3JZTKYx0sNbUA6ZlXU1WQEACEY95RVoXpoAxqfMoSe x0XG5JYANU6oWX+L/evXegxD7bzvQgAs= X-Gm-Gg: ATEYQzxdg1atzZY0iZSGAjjfkwnN63SBf0meSonxwQm9adD9XmyldKKB8OtwSNuBwK0 gB1mlTOMlqI4C8ijYEjKWRiUIh1NJm+90xSXwgkyPXDyaZdaLfE6iMJSLTBB4sNxFgcpJefPzbN 4+cWIejZeNlZViqijeRlJXcPiJAcM1TJcGdi5r3RfXkLMP9JDuUkf3sOA8KBCzT99t5KpkypnWI FANK6iOQUy5n/BbKKog+QOPwcnz5wzOUJsblu5e+ozhjp9LAUFVslsLSJU8xAseDp48q16rWu85 R2aDrRx2 X-Received: by 2002:a17:90b:4a81:b0:35a:24f3:2c8e with SMTP id 98e67ed59e1d1-35dc6e48f8emr4245269a91.9.1775069703779; Wed, 01 Apr 2026 11:55:03 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Masahiko Sawada Date: Wed, 1 Apr 2026 11:54:26 -0700 X-Gm-Features: AQROBzCQxuuEGhyu3aO3CHEojtkvsbP1Mr2Lr-qjxtIWEWdz6x7_QFyQ9_Wtwag Message-ID: Subject: Re: POC: Parallel processing of indexes in autovacuum To: SATYANARAYANA NARLAPURAM Cc: Daniil Davydov <3danissimo@gmail.com>, Bharath Rupireddy , Sami Imseih , Alexander Korotkov , Matheus Alcantara , Maxim Orlov , Postgres hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Mon, Mar 30, 2026 at 5:14=E2=80=AFPM SATYANARAYANA NARLAPURAM wrote: > > Hi > > On Mon, Mar 30, 2026 at 1:44=E2=80=AFAM Daniil Davydov <3danissimo@gmail.= com> wrote: >> >> Hi, >> >> On Mon, Mar 30, 2026 at 7:17=E2=80=AFAM SATYANARAYANA NARLAPURAM >> wrote: >> > >> > Thank you for working on this, very useful feature. Sharing a few thou= ghts: >> > >> > 1. Shouldn't we also cap by max_parallel_workers to avoid wasting DSM = resources in parallel_vacuum_compute_workers? >> >> Actually, autovacuum_max_parallel_workers is already limited by >> max_parallel_workers. It is not clear for me why we allow setting this G= UC >> higher than max_parallel_workers, but if this happens, I think it is a u= ser's >> misconfiguration. >> >> > 2. Is it intentional that other autovacuum workers not yield cost limi= ts to the parallel auto vacuum workers? Cost limits are distributed first e= qually to the autovacuum workers. >> > and then they share that. Therefore, parallel workers will be heavily = throttled. IIUC, this problem doesn't exist with manual vacuum. >> > If we don't fix this, at least we should document this. >> >> Parallel a/v workers inherit cost based parameters (including the >> vacuum_cost_limit) from the leader worker. Do you mean that this can be = too >> low value for parallel operation? If so, user can manually increase the >> vacuum_cost_limit reloption for those tables, where parallel a/v sleeps = too >> much (due to cost delay). >> >> BTW, describing the cost limit propagation to the parallel a/v workers i= s >> worth mentioning in the documentation. I'll add it in the next patch ver= sion. >> >> > 3. Additionally, is there a point where, based on the cost limits, lau= nching additional workers becomes counterproductive compared to running few= er workers and preventing it? >> >> I don't think that we can possibly find a universal limit that will be >> appropriate for all possible configurations. By now we are using a prett= y >> simple formula for parallel degree calculation. Since user have several = ways >> to affect this formula, I guess that there will be no problems with it (= except >> my concerns about opt-out style). >> >> > 4. Would it make sense to add a table level override to disable parall= elism or set parallel worker count? >> >> We already have the "autovacuum_parallel_workers" reloption that is used= as >> an additional limit for the number of parallel workers. In particular, t= his >> reloption can be used to disable parallelism at all. >> >> > >> > I ran some perf tests to show the improvements with parallel vacuum an= d shared below. >> >> Thank you very much! >> >> > Observations: >> > >> > 1. Parallel autovacuum provides consistent speedup. With cost_limit=3D= 200 and >> > 7 workers, vacuum completes 1.41x faster (71s -> 50s). With cost_li= mit=3D60, >> > the speedup is 1.25x (194s -> 154s). >> > 2. I see the benefit comes from parallelizing index vacuum. With 8 ind= exes totaling >> > ~530 MB, parallel workers scan indexes concurrently instead of the = leader >> > scanning them one by one. The leader's CPU user time drops from ~3s= to >> > ~0.8s as index work is offloaded >> > >> >> 1.41 speedup with 7 parallel workers may not seem like a great win, but = it is >> a whole time of autovacuum operation (not only index bulkdel/cleanup) wi= th >> pretty small indexes. >> >> May I ask you to run the same test with a higher table's size (several d= ozen >> gigabytes)? I think the results will be more "expressive". > > > I ran it with a Billion rows in a table with 8 indexes. The improvement w= ith 7 workers is 1.8x. > Please note that there is a fixed overhead in other vacuum steps, for exa= mple heap scan. > In the environments where cost-based delay is used (the default), benefit= s will be modest > unless vacuum_cost_delay is set to sufficiently large value. > > Hardware: > CPU: Intel Xeon Platinum 8573C, 1 socket =C3=97 8 cores =C3=97 2 th= reads =3D 16 vCPUs > RAM: 128 GB (131,900 MB) > Swap: None > > Workload Description > > Table Schema: > CREATE TABLE avtest ( > id bigint PRIMARY KEY, > col1 int, -- random()*1e9 > col2 int, -- random()*1e9 > col3 int, -- random()*1e9 > col4 int, -- random()*1e9 > col5 int, -- random()*1e9 > col6 text, -- 'text_' || random()*1e6 (short text ~10= chars) > col7 timestamp, -- now() - random()*365 days > padding text -- repeat('x', 50) > ) WITH (fillfactor =3D 90); > > Indexes (8 total): > avtest_pkey =E2=80=94 btree on (id) bigint > idx_av_col1 =E2=80=94 btree on (col1) int > idx_av_col2 =E2=80=94 btree on (col2) int > idx_av_col3 =E2=80=94 btree on (col3) int > idx_av_col4 =E2=80=94 btree on (col4) int > idx_av_col5 =E2=80=94 btree on (col5) int > idx_av_col6 =E2=80=94 btree on (col6) text > idx_av_col7 =E2=80=94 btree on (col7) timestamp > > Dead Tuple Generation: > DELETE FROM avtest WHERE id % 5 IN (1, 2); > This deletes exactly 40% of rows, uniformly distributed across all page= s. > > Vacuum Trigger: > Autovacuum is triggered naturally by lowering the threshold to 0 and se= tting > scale_factor to a value that causes immediate launch after the DELETE. > > Worker Configurations Tested: > 0 workers =E2=80=94 leader-only vacuum (baseline, no parallelism) > 2 workers =E2=80=94 leader + 2 parallel workers (3 processes total) > 4 workers =E2=80=94 leader + 4 parallel workers (5 processes total) > 7 workers =E2=80=94 leader + 7 parallel workers (8 processes total, 1 = per index) > > Dataset: > Rows: 1,000,000,000 > Heap size: 139 GB > Total size: 279 GB (heap + 8 indexes) > Dead tuples: 400,000,000 (40%) > > Index Sizes: > avtest_pkey 21 GB (bigint) > idx_av_col7 21 GB (timestamp) > idx_av_col1 18 GB (int) > idx_av_col2 18 GB (int) > idx_av_col3 18 GB (int) > idx_av_col4 18 GB (int) > idx_av_col5 18 GB (int) > idx_av_col6 7 GB (text =E2=80=94 shorter keys, smaller index) > Total indexes: 139 GB > > Server Settings: > shared_buffers =3D 96GB > maintenance_work_mem =3D 1GB > max_wal_size =3D 100GB > checkpoint_timeout =3D 1h > autovacuum_vacuum_cost_delay =3D 0ms (NO throttling) > autovacuum_vacuum_cost_limit =3D 1000 > > > Summary: > > Workers Avg(s) Min(s) Max(s) Speedup Time Saved > ------- ------ ------ ------ ------- ---------- > 0 1645.93 1645.01 1646.84 1.00x =E2=80=94 > 2 1276.35 1275.64 1277.05 1.29x 369.58s (6.2 min) > 4 1052.62 1048.92 1056.32 1.56x 593.31s (9.9 min) > 7 892.23 886.59 897.86 1.84x 753.70s (12.6 min) > Thank you for sharing the performance test results! While the benchmark results look good to me, have you compared the performance differences between parallel vacuum in the VACUUM command (with the PARALLEL option) and parallel vacuum in autovacuum? Since parallel autovacuum introduces some logic to check for delay parameter updates, I thought it was worth verifying if this adds any overhead. BTW, in my view, the most challenging part of this patch is the propagation logic for vacuum delay parameters. This propagation is necessary because, unlike manual VACUUM, autovacuum workers can reload their configuration during operation. We must ensure that parallel workers stay synchronized with these updated parameters. The current patch implements this in vacuumparallel.c: the leader shares delay parameters in DSM and updates them (if any vacuum delay parameters are updated) after a config reload, while workers poll for updates at every vacuum_delay_point() call to refresh their local variables. Another possible approach would be an event-driven model where the leader notifies workers after updating shared parameters=E2=80=94for exampl= e, by adding a shm_mq between the leader (as the sender) and each worker (as the receiver). I've compared these two ideas and opted for the former (polling). While a polling approach could theoretically be costly, the current implementation is self-contained within the parallel vacuum logic and does not touch the core parallel query infrastructure. The notification approach might look more elegant, but I'm concerned it adds unnecessary complexity just for the autovacuum case. Since the polling is essentially just checking an atomic variable, the overhead should be negligible. To verify this, I conducted benchmarks comparing the whole execution time and index vacuuming duration. Setup: - Disabled (auto) vacuum delays and buffer usage limits. - Parallel autovacuum with 1 worker on a table with 2 indexes (approx. 4 GB each). - 5 runs. Case 1: The latest patch (with polling) Average: 3.95s (Index: 1.54s) Median: 3.62s (Index: 1.37s) Case 2: The latest patch without polling Average: 3.98s (Index: 1.56s) Median: 3.70s (Index: 1.40s) Note that in order to simulate the code that doesn't have the polling, I reverted the following change: - if (InterruptPending || - (!VacuumCostActive && !ConfigReloadPending)) + if (InterruptPending) + return; + + if (IsParallelWorker()) + { + /* + * Update cost-based vacuum delay parameters for a parallel autovac= uum + * worker if any changes are detected. + */ + parallel_vacuum_update_shared_delay_params(); + } + + if (!VacuumCostActive && !ConfigReloadPending) The parallel vacuum workers don't check the shared vacuum delay parameter at all, which is still fine as I disabled vacuum delays. Overall, the results show no noticeable overhead from the polling approach. Regards, --=20 Masahiko Sawada Amazon Web Services: https://aws.amazon.com