MIME-Version: 1.0
From: Vladimir Ryabtsev <greatvovan@gmail.com>
Date: Sat, 21 Dec 2024 02:45:45 -0800
Message-ID: 
 <CAMqTPqn8y8Y+uDY0FPvX6ghD1DftLyz2nD6n6HhGOg-gHP4JdA@mail.gmail.com>
Subject: Memory
To: psycopg@postgresql.org
Content-Type: multipart/alternative; boundary="000000000000e9aa4f0629c577c3"
Archived-At: 
 <https://www.postgresql.org/message-id/CAMqTPqn8y8Y%2BuDY0FPvX6ghD1DftLyz2nD6n6HhGOg-gHP4JdA%40mail.gmail.com>
Precedence: bulk

--000000000000e9aa4f0629c577c3
Content-Type: text/plain; charset="UTF-8"

Hi community,

I am reading a big dataset using code similar to this:

query = '''
SELECT timestamp, data_source, tag, agg_value
FROM my_table
'''I
batch_size = 10_000_000

with psycopg.connect(cs, cursor_factory=psycopg.ClientrCursor) as conn:
  with conn.cursor('my_table') as cur:
    cur = cur.execute(query)
    while True:
      rows = cur.fetchmany(batch_size)
      # ...
      if not rows:
        break

The code is executed on a Databricks node, if that matters. The library
version is the latest.

I found that despite fetching in batches, memory consumption grows
continuously throughout the loop iterations and eventually the node goes
OOM. My code does not save any references, so it might be something
internal to the library.

If I change the factory to ServerCursor, the issue fixes, memory does not
grow after the first iteration.

I looked the documentation, but did not find specifics related to
performance differences between Server and Client cursors.

I am fine with ServerCursor, but I need to ask, is it by design that with
ClientCursor the result set is copied into memory despite fetchmany()
limit? ClientCursor is the default class, so may be worth documenting the
difference (sorry, if I missed that).

Thank you.

--000000000000e9aa4f0629c577c3
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi community,<div><br></div><div>I am reading a big datase=
t using code similar to this:</div><div><br></div><div><font face=3D"monosp=
ace">query =3D &#39;&#39;&#39;<br>SELECT timestamp, data_source, tag, agg_v=
alue<br>FROM my_table<br>&#39;&#39;&#39;I=C2=A0<br>batch_size =3D 10_000_00=
0<br></font><br><font face=3D"monospace">with psycopg.connect(cs, cursor_fa=
ctory=3Dpsycopg.ClientrCursor) as conn:<br>=C2=A0 with conn.cursor(&#39;my_=
table&#39;) as cur:<br>=C2=A0 =C2=A0 cur =3D cur.execute(query)<br>=C2=A0 =
=C2=A0 while True:<br>=C2=A0 =C2=A0 =C2=A0 rows =3D cur.fetchmany(batch_siz=
e)</font></div><div><font face=3D"monospace">=C2=A0 =C2=A0 =C2=A0 # ...<br>=
=C2=A0 =C2=A0 =C2=A0 if not rows:<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 break</fon=
t><br><br></div><div>The code is executed on a Databricks node, if that mat=
ters. The library version is the latest.</div><div><br></div><div>I found t=
hat despite fetching in batches, memory=C2=A0consumption grows continuously=
 throughout the loop iterations and eventually the node goes OOM. My code d=
oes not save any references, so it might be something internal to the libra=
ry.</div><div><br></div><div>If I change the factory to ServerCursor, the i=
ssue fixes, memory does not grow after the first iteration.</div><div><br><=
/div><div>I looked the documentation, but did not find specifics related to=
 performance differences between=C2=A0Server and Client cursors.</div><div>=
<br></div><div>I am fine with ServerCursor, but I need to ask, is it by des=
ign that with ClientCursor the result set is copied into memory despite fet=
chmany() limit? ClientCursor is the default class, so may be worth document=
ing the difference (sorry, if I missed that).</div><div><br></div><div>Than=
k you.</div><div><br></div></div>

--000000000000e9aa4f0629c577c3--