From: Robert Leach <rleach@princeton.edu>
Message-Id: <0FE9C709-A108-4ED5-8132-B802B8D9908F@princeton.edu>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_ABF3AA76-48F4-416D-ABFA-86E39773A17A"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.500.171.1.1\))
Subject: Re: How to perform a long running dry run transaction without
 blocking
Date: Thu, 6 Feb 2025 12:58:08 -0500
In-Reply-To: <6d833658-f461-4ad4-a3e1-86d3c515bc18@aklaver.com>
Cc: pgsql-general <pgsql-general@postgresql.org>
To: Adrian Klaver <adrian.klaver@aklaver.com>
References: <BD62E056-3F3B-4CC0-A8CA-E5B7B9CB35CA@princeton.edu>
 <88d60ace-45e6-4d41-afc4-113df7219c4d@aklaver.com>
 <4000D0EE-B250-4E9E-831F-00C034D6D0B5@princeton.edu>
 <6d833658-f461-4ad4-a3e1-86d3c515bc18@aklaver.com>
Archived-At: <https://www.postgresql.org/message-id/0FE9C709-A108-4ED5-8132-B802B8D9908F%40princeton.edu>
Precedence: bulk


--Apple-Mail=_ABF3AA76-48F4-416D-ABFA-86E39773A17A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

> Please reply to list also.
> Ccing list

=F0=9F=91=8D=F0=9F=8F=BB

>> They enter the data in an excel spreadsheet containing about a dozen =
inter-related worksheets, named, for example: Study, Animals, Samples, =
Tissues, Treatments, Sequences
>=20
> Long term is there a thought to have them enter directly into database =
where validation can happen in real time

No, not really.  Sample metadata tends to be an afterthought to =
researchers.  They have it in their notebooks and getting them to enter =
it at all is like pulling teeth.  The validation interface actually has =
a bunch of other features I haven't mentioned that streamline the =
process for them.  Before it gets to actually validating the data, it =
tries to lighten the manual burden on the researchers (and help with =
consistent nomenclature) by pulling sample names out of the raw files, =
massaging them, and filling those in along with a mass of common data =
that is used to populate drop-downs in the excel columns to avoid =
researcher typos and value variants.

Having everything work with excel actually made the site more attractive =
to the researchers, because they're comfortable with it and use it =
already, so it lowered the bar for using our software.

Besides, we don't trust the users enough to enter data unsupervised.  =
There are a lot of aspects of the data that cannot be automatically =
validated and involve experimental parameters that are adjacent to the =
purpose of our site.  We have curators that need to look at everything =
to ensure consistency, and looking at all the data in context is =
necessary before any of it is entered.

That said, back in the aughts, I wrote a perl cgi site for a toxin and =
virulence factor database that used a web interface for data entry and =
achieved the curation goal by saving a form of all inter-related data.  =
The submit button sent that form to a list of curators who could approve =
the insert/update and make it actually happen.  I think I had actually =
suggested that form of data entry when this current project first =
started, but I was overruled.  However, in this project, the equivalent =
procedure would be per-sample, and you'd lose out on the overall =
context.  It's an interesting challenge, but I think we're pretty =
committed now on this file load path.

>>> Where are the background processes loading data to?
>> We first run our loads on a development server with a separate copy =
of the database, but then we re-run those same loads on the production =
server, where users perform their validations.
>=20
> The load to the development server does no validation?
>=20
> If so what is the purpose?
>=20
> The background processes are other validation runs?

It's the same code that executes in both cases (with or without the =
`--validate` flag).  All that that flag does is it (effectively) raises =
the dry run exception before it leaves the transaction block, so it =
always validates (whether the flag is supplied or not).

So the load doesn't fail until the end of the run, which is inefficient =
from a maintenance perspective.  I've been thinking of adding a =
`--failfast` option for use on the back end.  Haven't done it yet.  I =
started a load yesterday in fact that ran 2 hours before it buffered an =
exception related to a newly introduced bug.  I fixed the bug and ran =
the load again.  It finished sometime between COB yesterday and this =
morning (successfully!).

>> One of the thoughts I'd had to work around the problem was to somehow =
dump the data from the development load and load it onto production in =
some sort of scheduled downtime or something.  However, even if we do =
that, I'm concerned that multiple users concurrently validating =
different submissions would encounter this blocking issue, and since =
those validations can take (on the upper end) just under a minute, it's =
enough for at least 1 user to encounter a timeout.  I have not yet =
proven that can happen, but based on my reading of the postgres =
documentation, it seems logical.
>=20
> Seems you are looking for some sort of queuing system.
>=20
> What are the time constraints for getting the validation turned =
around.

I have considered a queuing system, though when I previously floated a =
proof of concept using celery, I was informed it was too much.  Though, =
at the time, all I was trying to do was a progress bar for a query stats =
feature.  So proposing celery in this instance may get more traction =
with the rest of the team.

Most of the small validation processes finish in under a dozen seconds.  =
The largest studies take just under a minute.  I have plans to optimize =
the loading scripts that hopefully could get the largest studies down to =
a dozen seconds.  If I could do that, and do the back end loads in =
off-peak hours, then I'd be willing to suffer the rare timeouts from =
concurrent validations.  The raw data loads will still likely take a =
much longer time.


Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544


--Apple-Mail=_ABF3AA76-48F4-416D-ABFA-86E39773A17A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"overflow-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: =
after-white-space;"><div><blockquote type=3D"cite"><div><div>Please =
reply to list also.<br>Ccing =
list<br></div></div></blockquote><div><br></div><div><span =
style=3D"font-size:-webkit-xxx-large">=F0=9F=91=8D=F0=9F=8F=BB</span><br><=
/div><br><blockquote type=3D"cite"><div><div><blockquote =
type=3D"cite">They enter the data in an excel spreadsheet containing =
about a dozen inter-related worksheets, named, for example: Study, =
Animals, Samples, Tissues, Treatments, Sequences =
<br></blockquote><br>Long term is there a thought to have them enter =
directly into database where validation can happen in real =
time<br></div></div></blockquote><div><br></div><div>No, not really. =
&nbsp;Sample metadata tends to be an afterthought to researchers. =
&nbsp;They have it in their notebooks and getting them to enter it at =
all is like pulling teeth. &nbsp;The validation interface actually has a =
bunch of other features I haven't mentioned that streamline the process =
for them. &nbsp;Before it gets to actually validating the data, it tries =
to lighten the manual burden on the researchers (and help with =
consistent nomenclature) by pulling sample names out of the raw files, =
massaging them, and filling those in along with a mass of common data =
that is used to populate drop-downs in the excel columns to avoid =
researcher typos and value variants.</div><div><br></div><div>Having =
everything work with excel actually made the site more attractive to the =
researchers, because they're comfortable with it and use it already, so =
it lowered the bar for using our =
software.</div><div><br></div><div>Besides, we don't trust the users =
enough to enter data unsupervised. &nbsp;There are a lot of aspects of =
the data that cannot be automatically validated and involve experimental =
parameters that are adjacent to the purpose of our site. &nbsp;We have =
curators that need to look at everything to ensure consistency, and =
looking at all the data in context is necessary before any of it is =
entered.</div><div><br></div><div>That said, back in the aughts, I wrote =
a perl cgi site for a toxin and virulence factor database that used a =
web interface for data entry and achieved the curation goal by saving a =
form of all inter-related data. &nbsp;The submit button sent that form =
to a list of curators who could approve the insert/update and make it =
actually happen. &nbsp;I think I had actually suggested that form of =
data entry when this current project first started, but I was overruled. =
&nbsp;However, in this project, the equivalent procedure would be =
per-sample, and you'd lose out on the overall context. &nbsp;It's an =
interesting challenge, but I think we're pretty committed now on this =
file load path.</div><br><blockquote type=3D"cite"><div><div><blockquote =
type=3D"cite"><blockquote type=3D"cite">Where are the background =
processes loading data to?<br></blockquote>We first run our loads on a =
development server with a separate copy of the database, but then we =
re-run those same loads on the production server, where users perform =
their validations.<br></blockquote><br>The load to the development =
server does no validation?<br><br>If so what is the purpose?<br><br>The =
background processes are other validation =
runs?<br></div></div></blockquote><div><br></div><div>It's the same code =
that executes in both cases (with or without the `--validate` flag). =
&nbsp;All that that flag does is it (effectively) raises the dry run =
exception before it leaves the transaction block, so it always validates =
(whether the flag is supplied or not).</div><div><br></div><div>So the =
load doesn't fail until the end of the run, which is inefficient from a =
maintenance perspective. &nbsp;I've been thinking of adding a =
`--failfast` option for use on the back end. &nbsp;Haven't done it yet. =
&nbsp;I started a load yesterday in fact that ran 2 hours before it =
buffered an exception related to a newly introduced bug. &nbsp;I fixed =
the bug and ran the load again. &nbsp;It finished sometime between COB =
yesterday and this morning (successfully!).</div><br><blockquote =
type=3D"cite"><div><div><blockquote type=3D"cite">One of the thoughts =
I'd had to work around the problem was to somehow dump the data from the =
development load and load it onto production in some sort of scheduled =
downtime or something. &nbsp;However, even if we do that, I'm concerned =
that multiple users concurrently validating different submissions would =
encounter this blocking issue, and since those validations can take (on =
the upper end) just under a minute, it's enough for at least 1 user to =
encounter a timeout. &nbsp;I have not yet proven that can happen, but =
based on my reading of the postgres documentation, it seems =
logical.<br></blockquote><br>Seems you are looking for some sort of =
queuing system.<br><br>What are the time constraints for getting the =
validation turned =
around.<br></div></div></blockquote><div><br></div><div>I have =
considered a queuing system, though when I previously floated a proof of =
concept using celery, I was informed it was too much. &nbsp;Though, at =
the time, all I was trying to do was a progress bar for a query stats =
feature. &nbsp;So proposing celery in this instance may get more =
traction with the rest of the team.</div><div><br></div><div>Most of the =
small validation processes finish in under a dozen seconds. &nbsp;The =
largest studies take just under a minute. &nbsp;I have plans to optimize =
the loading scripts that hopefully could get the largest studies down to =
a dozen seconds. &nbsp;If I could do that, and do the back end loads in =
off-peak hours, then I'd be willing to suffer the rare timeouts from =
concurrent validations. &nbsp;The raw data loads will still likely take =
a much longer time.</div><br><div><br></div><div>Robert William =
Leach</div><div>Research Software Engineer</div><div>133 Carl C. Icahn =
Lab</div><div>Lewis-Sigler Institute for Integrative =
Genomics</div><div>Princeton University</div><div>Princeton, NJ =
08544</div><div><br></div></div></body></html>=

--Apple-Mail=_ABF3AA76-48F4-416D-ABFA-86E39773A17A--