Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1q7jWe-0000uQ-JR for pgsql-hackers@arkaria.postgresql.org; Fri, 09 Jun 2023 21:20:04 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1q7jWd-0005Hv-3W for pgsql-hackers@arkaria.postgresql.org; Fri, 09 Jun 2023 21:20:03 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1q7jWc-0005Fv-Mh for pgsql-hackers@lists.postgresql.org; Fri, 09 Jun 2023 21:20:02 +0000 Received: from mail-oi1-x233.google.com ([2607:f8b0:4864:20::233]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1q7jWa-001Hnn-3w for pgsql-hackers@lists.postgresql.org; Fri, 09 Jun 2023 21:20:02 +0000 Received: by mail-oi1-x233.google.com with SMTP id 5614622812f47-39a3f2668d5so975001b6e.1 for ; Fri, 09 Jun 2023 14:19:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1686345598; x=1688937598; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=ID3EIX29NNc+CFEEwuEUsMUVssR/mTobK5vZMB54CM4=; b=pKOHxaXoV8+S8URPwQgwMD8yW85KYCKOhjrwDwAO+JNYnVZ7Ylqt5anbihfpZdFkDc spR92NPbNsKNkyVbsGcxwtQiB/Aln9xdZpMUTm4QEAumIlYCiaoGSN+gp2tY+Yq5T/6k n+pAcEquUCXeKq7kNPXi+6JU6x+F63exSs8FTH05oqsEAS+XaxG9XteIdXaNwlAzsjml 8M0Y+60DZZ13/zdM9uX/L7JcgwGgqxO0OQxmqezj72wqDrphGgK3Zcq/ZgLF60VRedwO H+ORuQs/mNmnAZbq1qmzAeZPG+oiIO4/jUJM6+N4tYlnfcl0p7qP8/YDMY0G2JGnwvYT UiWQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686345598; x=1688937598; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ID3EIX29NNc+CFEEwuEUsMUVssR/mTobK5vZMB54CM4=; b=VEkkQhwGcHzL6IDhPkErDL0ftYMGslSHobGcyGw4ilQTIK78qst5cutxA9z9qmQfBI MJmcK3LRzc06wgE4B2+9a08+AMli73+PF2FjJGani1j5kWiT/GzBmcrYOMO6X464P/yD VHaaBptDW4SOhVGcI+xejZAen5qWOW8Ce+2o0tJha9AEptKoQnjNm0EoOeZK5zxv3k5V Vz5xrnJSGwWS4Z2JZfGR6nZmZwhvUYsm6ReWsW2iZ0tQ6e2gC/2iUBHbvROz2OK2Pprq ArIVdlSO6CQrmDfXi4Xut5bhRQmGAg5KPGmAFUzkyhy76S+ItRrJxvpWGBA7iJtD9JNd RBGA== X-Gm-Message-State: AC+VfDzMV50wLlQ8x8agR05pXCdxTba3EBueqiPg0iqtyUHCA3KymBXC 9tffcA9eCshC/rm6l297KXk7m+VV1r5iiPdSyLQ= X-Google-Smtp-Source: ACHHUZ6u4n6jkpDSi1Q9yl9SBbn7HVTHQVw76KlRiAf/mX/c20QMkfxyhzIs1dvfHTTXBRHxEy/RfJI388qLXuiwA3k= X-Received: by 2002:a05:6808:618:b0:39a:a9e6:ae9 with SMTP id y24-20020a056808061800b0039aa9e60ae9mr2236981oih.57.1686345598372; Fri, 09 Jun 2023 14:19:58 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Gregory Smith Date: Fri, 9 Jun 2023 17:19:47 -0400 Message-ID: Subject: Re: index prefetching To: Tomas Vondra Cc: PostgreSQL Hackers , Georgios Content-Type: multipart/alternative; boundary="0000000000005c0f0705fdb8eeef" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000005c0f0705fdb8eeef Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Jun 8, 2023 at 11:40=E2=80=AFAM Tomas Vondra wrote: > We already do prefetching for bitmap index scans, where the bitmap heap > scan prefetches future pages based on effective_io_concurrency. I'm not > sure why exactly was prefetching implemented only for bitmap scans At the point Greg Stark was hacking on this, the underlying OS async I/O features were tricky to fix into PG's I/O model, and both of us did much review work just to find working common ground that PG could plug into. Linux POSIX advisories were completely different from Solaris's async model, the other OS used for validation that the feature worked, with the hope being that designing against two APIs would be better than just focusing on Linux. Since that foundation was all so brittle and limited, scope was limited to just the heap scan, since it seemed to have the best return on time invested given the parts of async I/O that did and didn't scale as expected. As I remember it, the idea was to get the basic feature out the door and gather feedback about things like whether the effective_io_concurrency knob worked as expected before moving onto other prefetching. Then that got lost in filesystem upheaval land, with so much drama around Solaris/ZFS and Oracle's btrfs work. I think it's just that no one ever got back to it. I have all the workloads that I use for testing automated into pgbench-tools now, and this change would be easy to fit into testing on them as I'm very heavy on block I/O tests. To get PG to reach full read speed on newer storage I've had to do some strange tests, like doing index range scans that touch 25+ pages. Here's that one as a pgbench script: \set range 67 * (:multiplier + 1) \set limit 100000 * :scale \set limit :limit - :range \set aid random(1, :limit) SELECT aid,abalance FROM pgbench_accounts WHERE aid >=3D :aid ORDER BY aid LIMIT :range; And then you use '-Dmultiplier=3D10' or such to crank it up. Database 4X RAM, multiplier=3D25 with 16 clients is my starting point on it when I want to saturate storage. Anything that lets me bring those numbers down would be valuable. -- Greg Smith greg.smith@crunchydata.com Director of Open Source Strategy --0000000000005c0f0705fdb8eeef Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Thu, Jun 8, 2023 at 11:40=E2=80=AFAM T= omas Vondra <tomas.vond= ra@enterprisedb.com> wrote:
We already do prefetching for bit= map index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not=
sure why exactly was prefetching implemented only for bitmap scans

At the point Greg Stark was hacking on this, the un= derlying OS async I/O features were tricky to fix into PG's I/O model, = and both of us did much review work just to find working common ground that= PG could plug into.=C2=A0 Linux POSIX advisories were completely different= from Solaris's async model, the other OS used for validation that the = feature worked, with the hope being that designing against two APIs would b= e better than just focusing on Linux.=C2=A0 Since that foundation was all s= o brittle and limited, scope was limited to just the heap scan, since it se= emed to have the best return on time invested given the parts of async I/O = that did and didn't scale as expected.

As= I remember it, the idea was to get the basic feature out the door and gath= er feedback about things like whether the effective_io_concurrency knob wor= ked as expected before moving onto other prefetching.=C2=A0 Then that got l= ost in filesystem upheaval land, with so much drama around Solaris/ZFS and = Oracle's btrfs work.=C2=A0 I think it's just that no one ever got b= ack to it.

I have all the workloads that I use for= testing automated into pgbench-tools now, and this change would be easy to= fit into testing on them as I'm very heavy on block I/O tests.=C2=A0 T= o get PG to reach full read speed on newer storage I've had to do some = strange tests, like doing index range scans that touch 25+ pages.=C2=A0 Her= e's that one as a pgbench script:

\set range <= span class=3D"gmail-pl-c1">67 * (:= multiplier + = 1)
\set limit= 100000 * :scale
\set limit= :limit - :range
\set aid random(= 1, :limit)
SELECT aid,abal= ance FROM pgbench_accounts WHERE aid >=3D :ai= d ORDER BY aid LIMIT :range;

And then you use '= ;-Dmultiplier=3D10' or such to crank it up.=C2=A0 Database 4X RAM, mult= iplier=3D25 with 16 clients is my starting point on it when I want to satur= ate storage.=C2=A0 Anything that lets me bring those numbers down would be = valuable.

--
Greg Smith=C2=A0 greg.smith@crunchydata.com
Director= of Open Source Strategy
--0000000000005c0f0705fdb8eeef--