public inbox for [email protected]  
help / color / mirror / Atom feed
Memory
3+ messages / 2 participants
[nested] [flat]

* Memory
@ 2024-12-21 10:45  Vladimir Ryabtsev <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Vladimir Ryabtsev @ 2024-12-21 10:45 UTC (permalink / raw)
  To: psycopg

Hi community,

I am reading a big dataset using code similar to this:

query = '''
SELECT timestamp, data_source, tag, agg_value
FROM my_table
'''I
batch_size = 10_000_000

with psycopg.connect(cs, cursor_factory=psycopg.ClientrCursor) as conn:
  with conn.cursor('my_table') as cur:
    cur = cur.execute(query)
    while True:
      rows = cur.fetchmany(batch_size)
      # ...
      if not rows:
        break

The code is executed on a Databricks node, if that matters. The library
version is the latest.

I found that despite fetching in batches, memory consumption grows
continuously throughout the loop iterations and eventually the node goes
OOM. My code does not save any references, so it might be something
internal to the library.

If I change the factory to ServerCursor, the issue fixes, memory does not
grow after the first iteration.

I looked the documentation, but did not find specifics related to
performance differences between Server and Client cursors.

I am fine with ServerCursor, but I need to ask, is it by design that with
ClientCursor the result set is copied into memory despite fetchmany()
limit? ClientCursor is the default class, so may be worth documenting the
difference (sorry, if I missed that).

Thank you.


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Memory
@ 2024-12-21 16:34  Adrian Klaver <[email protected]>
  parent: Vladimir Ryabtsev <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Adrian Klaver @ 2024-12-21 16:34 UTC (permalink / raw)
  To: Vladimir Ryabtsev <[email protected]>; psycopg

On 12/21/24 02:45, Vladimir Ryabtsev wrote:
> Hi community,
> 
> I am reading a big dataset using code similar to this:
> 
> query = '''
> SELECT timestamp, data_source, tag, agg_value
> FROM my_table
> '''I
> batch_size = 10_000_000
> 
> with psycopg.connect(cs, cursor_factory=psycopg.ClientrCursor) as conn:
  																
FYI, ClientCursor.


> 
> I looked the documentation, but did not find specifics related to 
> performance differences between Server and Client cursors.
> 
> I am fine with ServerCursor, but I need to ask, is it by design that 
> with ClientCursor the result set is copied into memory despite 
> fetchmany() limit? ClientCursor is the default class, so may be worth 
> documenting the difference (sorry, if I missed that).

Client side cursor

https://www.psycopg.org/psycopg3/docs/advanced/cursors.html#client-side-cursors

"In such querying pattern, after a cursor sends a query to the server 
(usually calling execute()), the server replies transferring to the 
client the whole set of results requested, which is stored in the state 
of the same cursor and from where it can be read from Python code (using 
methods such as fetchone() and siblings)."

https://www.psycopg.org/psycopg3/docs/api/cursors.html#psycopg.Cursor.fetchmany

"fetchmany(size: int = 0) → list[+Row]

     Return the next size records from the current recordset.

     size default to self.arraysize if not specified.

     Return type:

         Sequence[Row], with Row defined by row_factory

"


Server side cursor

https://www.psycopg.org/psycopg3/docs/advanced/cursors.html#server-side-cursors

"PostgreSQL has its own concept of cursor too (sometimes also called 
portal). When a database cursor is created, the query is not necessarily 
completely processed: the server might be able to produce results only 
as they are needed. Only the results requested are transmitted to the 
client: if the query result is very large but the client only needs the 
first few records it is possible to transmit only them.

The downside is that the server needs to keep track of the partially 
processed results, so it uses more memory and resources on the server."

> 
> Thank you.
> 

-- 
Adrian Klaver
[email protected]






^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Memory
@ 2024-12-21 21:52  Vladimir Ryabtsev <[email protected]>
  parent: Adrian Klaver <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: Vladimir Ryabtsev @ 2024-12-21 21:52 UTC (permalink / raw)
  To: Adrian Klaver <[email protected]>; +Cc: psycopg

Now I see, the doc is already great.
Thanks for pointing out Adrian.


On Sat, Dec 21, 2024 at 8:34 AM Adrian Klaver <[email protected]>
wrote:

> On 12/21/24 02:45, Vladimir Ryabtsev wrote:
> > Hi community,
> >
> > I am reading a big dataset using code similar to this:
> >
> > query = '''
> > SELECT timestamp, data_source, tag, agg_value
> > FROM my_table
> > '''I
> > batch_size = 10_000_000
> >
> > with psycopg.connect(cs, cursor_factory=psycopg.ClientrCursor) as conn:
>
> FYI, ClientCursor.
>
>
> >
> > I looked the documentation, but did not find specifics related to
> > performance differences between Server and Client cursors.
> >
> > I am fine with ServerCursor, but I need to ask, is it by design that
> > with ClientCursor the result set is copied into memory despite
> > fetchmany() limit? ClientCursor is the default class, so may be worth
> > documenting the difference (sorry, if I missed that).
>
> Client side cursor
>
>
> https://www.psycopg.org/psycopg3/docs/advanced/cursors.html#client-side-cursors
>
> "In such querying pattern, after a cursor sends a query to the server
> (usually calling execute()), the server replies transferring to the
> client the whole set of results requested, which is stored in the state
> of the same cursor and from where it can be read from Python code (using
> methods such as fetchone() and siblings)."
>
>
> https://www.psycopg.org/psycopg3/docs/api/cursors.html#psycopg.Cursor.fetchmany
>
> "fetchmany(size: int = 0) → list[+Row]
>
>      Return the next size records from the current recordset.
>
>      size default to self.arraysize if not specified.
>
>      Return type:
>
>          Sequence[Row], with Row defined by row_factory
>
> "
>
>
> Server side cursor
>
>
> https://www.psycopg.org/psycopg3/docs/advanced/cursors.html#server-side-cursors
>
> "PostgreSQL has its own concept of cursor too (sometimes also called
> portal). When a database cursor is created, the query is not necessarily
> completely processed: the server might be able to produce results only
> as they are needed. Only the results requested are transmitted to the
> client: if the query result is very large but the client only needs the
> first few records it is possible to transmit only them.
>
> The downside is that the server needs to keep track of the partially
> processed results, so it uses more memory and resources on the server."
>
> >
> > Thank you.
> >
>
> --
> Adrian Klaver
> [email protected]
>
>


^ permalink  raw  reply  [nested|flat] 3+ messages in thread


end of thread, other threads:[~2024-12-21 21:52 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-12-21 10:45 Memory Vladimir Ryabtsev <[email protected]>
2024-12-21 16:34 ` Adrian Klaver <[email protected]>
2024-12-21 21:52   ` Vladimir Ryabtsev <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox