MIME-Version: 1.0
References: <cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com>
In-Reply-To: <cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com>
From: Gregory Smith <gregsmithpgsql@gmail.com>
Date: Fri, 9 Jun 2023 17:19:47 -0400
Message-ID: <CAHLJuCXz62q1bpqzKNCNSzOg2LJ1c-UVpcCgxcxuNiaPykVp3w@mail.gmail.com>
Subject: Re: index prefetching
To: Tomas Vondra <tomas.vondra@enterprisedb.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, 
	Georgios <gkokolatos@protonmail.com>
Content-Type: multipart/alternative; boundary="0000000000005c0f0705fdb8eeef"
Archived-At: <https://www.postgresql.org/message-id/CAHLJuCXz62q1bpqzKNCNSzOg2LJ1c-UVpcCgxcxuNiaPykVp3w%40mail.gmail.com>
Precedence: bulk

--0000000000005c0f0705fdb8eeef
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, Jun 8, 2023 at 11:40=E2=80=AFAM Tomas Vondra <tomas.vondra@enterpri=
sedb.com>
wrote:

> We already do prefetching for bitmap index scans, where the bitmap heap
> scan prefetches future pages based on effective_io_concurrency. I'm not
> sure why exactly was prefetching implemented only for bitmap scans


At the point Greg Stark was hacking on this, the underlying OS async I/O
features were tricky to fix into PG's I/O model, and both of us did much
review work just to find working common ground that PG could plug into.
Linux POSIX advisories were completely different from Solaris's async
model, the other OS used for validation that the feature worked, with the
hope being that designing against two APIs would be better than just
focusing on Linux.  Since that foundation was all so brittle and limited,
scope was limited to just the heap scan, since it seemed to have the best
return on time invested given the parts of async I/O that did and didn't
scale as expected.

As I remember it, the idea was to get the basic feature out the door and
gather feedback about things like whether the effective_io_concurrency knob
worked as expected before moving onto other prefetching.  Then that got
lost in filesystem upheaval land, with so much drama around Solaris/ZFS and
Oracle's btrfs work.  I think it's just that no one ever got back to it.

I have all the workloads that I use for testing automated into
pgbench-tools now, and this change would be easy to fit into testing on
them as I'm very heavy on block I/O tests.  To get PG to reach full read
speed on newer storage I've had to do some strange tests, like doing index
range scans that touch 25+ pages.  Here's that one as a pgbench script:

\set range 67 * (:multiplier + 1)
\set limit 100000 * :scale
\set limit :limit - :range
\set aid random(1, :limit)
SELECT aid,abalance FROM pgbench_accounts WHERE aid >=3D :aid ORDER BY aid
LIMIT :range;

And then you use '-Dmultiplier=3D10' or such to crank it up.  Database 4X
RAM, multiplier=3D25 with 16 clients is my starting point on it when I want
to saturate storage.  Anything that lets me bring those numbers down would
be valuable.

--
Greg Smith  greg.smith@crunchydata.com
Director of Open Source Strategy

--0000000000005c0f0705fdb8eeef
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">On Thu, Jun 8, 2023 at 11:40=E2=80=AFAM T=
omas Vondra &lt;<a href=3D"mailto:tomas.vondra@enterprisedb.com">tomas.vond=
ra@enterprisedb.com</a>&gt; wrote:</div><div class=3D"gmail_quote"><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px=
 solid rgb(204,204,204);padding-left:1ex">We already do prefetching for bit=
map index scans, where the bitmap heap<br>
scan prefetches future pages based on effective_io_concurrency. I&#39;m not=
<br>
sure why exactly was prefetching implemented only for bitmap scans</blockqu=
ote><div><br></div><div>At the point Greg Stark was hacking on this, the un=
derlying OS async I/O features were tricky to fix into PG&#39;s I/O model, =
and both of us did much review work just to find working common ground that=
 PG could plug into.=C2=A0 Linux POSIX advisories were completely different=
 from Solaris&#39;s async model, the other OS used for validation that the =
feature worked, with the hope being that designing against two APIs would b=
e better than just focusing on Linux.=C2=A0 Since that foundation was all s=
o brittle and limited, scope was limited to just the heap scan, since it se=
emed to have the best return on time invested given the parts of async I/O =
that did and didn&#39;t scale as expected. <br></div><div><br></div><div>As=
 I remember it, the idea was to get the basic feature out the door and gath=
er feedback about things like whether the effective_io_concurrency knob wor=
ked as expected before moving onto other prefetching.=C2=A0 Then that got l=
ost in filesystem upheaval land, with so much drama around Solaris/ZFS and =
Oracle&#39;s btrfs work.=C2=A0 I think it&#39;s just that no one ever got b=
ack to it.</div><div><br></div><div>I have all the workloads that I use for=
 testing automated into pgbench-tools now, and this change would be easy to=
 fit into testing on them as I&#39;m very heavy on block I/O tests.=C2=A0 T=
o get PG to reach full read speed on newer storage I&#39;ve had to do some =
strange tests, like doing index range scans that touch 25+ pages.=C2=A0 Her=
e&#39;s that one as a pgbench script:</div><div><br></div><div><table class=
=3D"gmail-highlight gmail-tab-size gmail-js-file-line-container gmail-js-co=
de-nav-container gmail-js-tagsearch-file"><tbody><tr><td id=3D"gmail-LC1" c=
lass=3D"gmail-blob-code gmail-blob-code-inner gmail-js-file-line"><span sty=
le=3D"font-family:monospace">\<span class=3D"gmail-pl-k">set</span> range <=
span class=3D"gmail-pl-c1">67</span> <span class=3D"gmail-pl-k">*</span> (:=
multiplier <span class=3D"gmail-pl-k">+</span> <span class=3D"gmail-pl-c1">=
1</span>)</span></td>
        </tr>
        <tr>
          </tr></tbody></table><table class=3D"gmail-highlight gmail-tab-si=
ze gmail-js-file-line-container gmail-js-code-nav-container gmail-js-tagsea=
rch-file"><tbody><tr><td id=3D"gmail-LC2" class=3D"gmail-blob-code gmail-bl=
ob-code-inner gmail-js-file-line"><span style=3D"font-family:monospace">\<s=
pan class=3D"gmail-pl-k">set</span> <span class=3D"gmail-pl-k">limit</span>=
 <span class=3D"gmail-pl-c1">100000</span> <span class=3D"gmail-pl-k">*</sp=
an> :scale</span></td>
        </tr>
        <tr>
          </tr></tbody></table><table class=3D"gmail-highlight gmail-tab-si=
ze gmail-js-file-line-container gmail-js-code-nav-container gmail-js-tagsea=
rch-file"><tbody><tr><td id=3D"gmail-LC3" class=3D"gmail-blob-code gmail-bl=
ob-code-inner gmail-js-file-line"><span style=3D"font-family:monospace">\<s=
pan class=3D"gmail-pl-k">set</span> <span class=3D"gmail-pl-k">limit</span>=
 :<span class=3D"gmail-pl-k">limit</span> <span class=3D"gmail-pl-k">-</spa=
n> :range</span></td>
        </tr>
        <tr>
          </tr></tbody></table><table class=3D"gmail-highlight gmail-tab-si=
ze gmail-js-file-line-container gmail-js-code-nav-container gmail-js-tagsea=
rch-file"><tbody><tr><td id=3D"gmail-LC4" class=3D"gmail-blob-code gmail-bl=
ob-code-inner gmail-js-file-line"><span style=3D"font-family:monospace">\<s=
pan class=3D"gmail-pl-k">set</span> aid random(<span class=3D"gmail-pl-c1">=
1</span>, :<span class=3D"gmail-pl-k">limit</span>)</span></td>
        </tr>
        <tr>
          </tr></tbody></table><span class=3D"gmail-pl-k" style=3D"font-fam=
ily:monospace">SELECT</span><span style=3D"font-family:monospace"> aid,abal=
ance <span class=3D"gmail-pl-k">FROM</span> pgbench_accounts <span class=3D=
"gmail-pl-k">WHERE</span> aid <span class=3D"gmail-pl-k">&gt;=3D</span> :ai=
d <span class=3D"gmail-pl-k">ORDER BY</span> aid <span class=3D"gmail-pl-k"=
>LIMIT</span> :range;</span></div><div><br></div><div>And then you use &#39=
;-Dmultiplier=3D10&#39; or such to crank it up.=C2=A0 Database 4X RAM, mult=
iplier=3D25 with 16 clients is my starting point on it when I want to satur=
ate storage.=C2=A0 Anything that lets me bring those numbers down would be =
valuable.<br></div><div><br></div><div>--<br>Greg Smith=C2=A0 <a href=3D"ma=
ilto:greg.smith@crunchydata.com">greg.smith@crunchydata.com</a><br>Director=
 of Open Source Strategy</div></div></div>

--0000000000005c0f0705fdb8eeef--