public inbox for [email protected]  
help / color / mirror / Atom feed
From: Dominique Devienne <[email protected]>
To: [email protected]
Subject: Advice on efficiently logging outputs to PostgreSQL
Date: Tue, 15 Oct 2024 16:17:04 +0200
Message-ID: <CAFCRh-9nu9LXb6+p_pp2iXRxp4mrfJz-2S_cormOo7X-ZkBv=g@mail.gmail.com> (raw)

I have an existing heavy ETL that serially loads tons of data to PostgreSQL.
This is done using a CLI tool, processing one project after another.

I'd like to parallelize / distribute the work, which I could do from my CLI
tool, but 1) that would be confined to a single machine, and 2) we'd like to
provide a web UI to report progress and logs from the ETL workers / processes.

I mentioned in the past I have a basic LISTEN-NOTIFY-based work queue
in PostgreSQL, that would do nicely for this, but I'm wondering how to best
to deal with the logs.

The ETL process is quite chatty, and can generate quite a bit of logs.
Which for the Web UI to see (developped independently, by a separate team),
I'd want in PostgreSQL as well. The ETL worker wants to write them, almost
continuously. The Web UI wants to read them concurrently, to show them,
and ideally in a timely fashion.

My experience with CI tools like Jenkins is that logs are "laggy", and
arrive in big chunk, dozens of seconds appart, which is a subpar experience.
That's not PostgreSQL related, but just to illustrate what I'd like to avoid.

Has anyone done anything along these lines?

Given the way MVCC works in PostgreSQL, updating (by appending to) a
"file-like" bytea or text[] seems like a bad idea. So each log line should be
it's own row? Or lines with close timestamps aggregated together, to limit
rows generated?

If granular at the line level, will tons of small transactions be a problem?
And blow-up our "Oid budget" too rapidly?

Am I worrying too much? :)

I'd appreciate any advise or experience-based story around this
use-case, please.

Thanks, --DD






reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Advice on efficiently logging outputs to PostgreSQL
  In-Reply-To: <CAFCRh-9nu9LXb6+p_pp2iXRxp4mrfJz-2S_cormOo7X-ZkBv=g@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox