From: Robert Leach <rleach@princeton.edu>
Message-Id: <00DF52D1-5ED0-4599-8480-8C671434EE4E@princeton.edu>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_82E64062-A7A9-432F-B389-649CAA62999F"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.500.171.1.1\))
Subject: Re: How to perform a long running dry run transaction without
 blocking
Date: Thu, 6 Feb 2025 15:08:26 -0500
In-Reply-To: <b89a7132-7fa5-4229-a03c-20f5d1e11cf4@aklaver.com>
Cc: pgsql-general <pgsql-general@postgresql.org>
To: Adrian Klaver <adrian.klaver@aklaver.com>
References: <BD62E056-3F3B-4CC0-A8CA-E5B7B9CB35CA@princeton.edu>
 <88d60ace-45e6-4d41-afc4-113df7219c4d@aklaver.com>
 <4000D0EE-B250-4E9E-831F-00C034D6D0B5@princeton.edu>
 <6d833658-f461-4ad4-a3e1-86d3c515bc18@aklaver.com>
 <0FE9C709-A108-4ED5-8132-B802B8D9908F@princeton.edu>
 <b89a7132-7fa5-4229-a03c-20f5d1e11cf4@aklaver.com>
Archived-At: <https://www.postgresql.org/message-id/00DF52D1-5ED0-4599-8480-8C671434EE4E%40princeton.edu>
Precedence: bulk


--Apple-Mail=_82E64062-A7A9-432F-B389-649CAA62999F
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

>>> The load to the development server does no validation?
>>>=20
>>> If so what is the purpose?
>>>=20
>>> The background processes are other validation runs?
>> It's the same code that executes in both cases (with or without the =
`--validate` flag).  All that that flag does is it (effectively) raises =
the dry run exception before it leaves the transaction block, so it =
always validates (whether the flag is supplied or not).
>=20
> More for my sake then anything else, why do the load to the =
development server at all if the production load is the only one that =
counts?

The software is still in a new major version beta.  We're adding =
features and fixing bugs.  It's not unusual to encounter a new bug, fix =
it on dev to get the load to work, then deploy a point release on prod.  =
And that means repeated load attempts that interfere with the validation =
interface.  Besides, beyond this, we're planning on a separate staging =
database that dev effectively now is.  Sometimes, a curator only finds a =
technical data issue after the initial load while browsing the newly =
loaded data on the dev site.

>> So the load doesn't fail until the end of the run, which is =
inefficient from a maintenance perspective.  I've been thinking of =
adding a `--failfast` option for use on the back end.  Haven't done it =
yet.  I started a load yesterday in fact that ran 2 hours before it =
buffered an exception related to a newly introduced bug.  I fixed the =
bug and ran the load again.  It finished sometime between COB yesterday =
and this morning (successfully!).
>=20
> Alright I am trying to reconcile this with from below, 'The largest =
studies take just under a minute'.

The context of the 'The largest studies take just under a minute' =
statement is that it's not loading the hefty/time-consuming raw data.  =
It's only validating the metadata.  That's fast (5-60s).  And that data =
is a portion of the transaction in the back-end load.  There are errors =
that validation can miss that are due to not touching the raw data, and =
in fact, those errors are addressed by curators editing the excel =
sheets.  That's why it's all in the load transaction instead of loaded =
separately, but those problems are somewhat rare (and we currently have =
a new feature in the design phase that should almost completely =
eliminate those issues).

>>> Seems you are looking for some sort of queuing system.
>>>=20
>>> What are the time constraints for getting the validation turned =
around.
>> I have considered a queuing system, though when I previously floated =
a proof of concept using celery, I was informed it was too much.  =
Though, at the time, all I was trying to do was a progress bar for a =
query stats feature.  So proposing celery in this instance may get more =
traction with the rest of the team.
>> Most of the small validation processes finish in under a dozen =
seconds.   The largest studies take just under a minute.  I have plans =
to optimize the loading scripts that hopefully could get the largest =
studies down to a dozen seconds.  If I could do that, and do the back =
end loads in off-peak hours, then I'd be willing to suffer the rare =
timeouts from concurrent validations.  The raw data loads will still =
likely take a much longer time.
>=20
> This is where I get confused, probably because I am not exactly sure =
what constitutes validation. My sense is that involves a load of data =
into live tables and seeing what fails PK, FK or other constraints.
>=20
> If that is the case I am not seeing how the 'for real' data load would =
be longer?

The validation skips the time-consuming raw data load.  That raw data is =
collectively hundreds of gigs in size and could not be uploaded on the =
validation page anyway.  The feature I alluded to above that would make =
errors associated with the raw data almost completely eliminated is one =
where the researcher can drop the raw data folder into the form and it =
just walks the directory to get all the raw data file names and relative =
paths.  It's those data relationships whose validations are currently =
skipped.

> At any rate I can't see how loading into a live database multiple sets =
of data while operations are going on in the database can be made =
conflict free. To me  it seems the best that be done is:
>=20
> 1) Reduce chance for conflict by spreading the actions out.
>=20
> 2) Have retry logic that deals with conflicts.

I'm unfamiliar with retry functionality, but those options sound logical =
to me as a good path forward, particularly using celery to spread out =
validations and doing the back end loads at night (or using some sort of =
fast dump/load).  The thing that bothers me about the celery solution is =
that most of the time, 2 users validating different data will not block, =
so I would be making users wait for no reason.  Ideally, I could =
anticipate the block and only at that point, separate those validations.

This brings up a question though about a possibility I suspect is not =
practical.  My initial read of the isolation levels documentation found =
this section really promising:

> The Repeatable Read isolation level only sees data committed before =
the transaction began; it never sees either uncommitted data or changes =
committed during transaction execution by concurrent transactions.

This was before I realized that the actions of the previously started =
transaction would include "locks" that would block validation even =
though the load transaction hasn't committed yet:

> a target row might have already been updated (or deleted or locked) by =
another concurrent transaction by the time it is found. In this case, =
the repeatable read transaction will wait for the first updating =
transaction to commit or roll back

Other documentation I read referred to the state of the DB (when a =
transaction starts) as a "snapshot" and I thought... what if I could =
save such a snapshot automatically just before a back-end load starts, =
and use that snapshot for validation, such that my validation processes =
could use that to validate against and not encounter any locks?  The =
validation will never commit, so there's no risk.

I know Django's ORM wouldn't support that, but I kind of hoped that =
someone in this email list might suggest a snapshot functionality as a =
possible solution.  Since the validations never commit, the only =
downside would be if the backend load changed something that introduces =
a problem with the validated data that would not be fixed until we =
actually attempt to load it.

Is that too science-fictiony of an idea?=

--Apple-Mail=_82E64062-A7A9-432F-B389-649CAA62999F
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; =
charset=3Dus-ascii"></head><body style=3D"overflow-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: =
after-white-space;"><div><blockquote type=3D"cite"><div><div><blockquote =
type=3D"cite"><blockquote type=3D"cite">The load to the development =
server does no validation?<br><br>If so what is the purpose?<br><br>The =
background processes are other validation runs?<br></blockquote>It's the =
same code that executes in both cases (with or without the `--validate` =
flag). &nbsp;All that that flag does is it (effectively) raises the dry =
run exception before it leaves the transaction block, so it always =
validates (whether the flag is supplied or =
not).<br></blockquote><br>More for my sake then anything else, why do =
the load to the development server at all if the production load is the =
only one that =
counts?<br></div></div></blockquote><div><br></div><div>The software is =
still in a new major version beta. &nbsp;We're adding features and =
fixing bugs. &nbsp;It's not unusual to encounter a new bug, fix it on =
dev to get the load to work, then deploy a point release on prod. =
&nbsp;And that means repeated load attempts that interfere with the =
validation interface. &nbsp;Besides, beyond this, we're planning on a =
separate staging database that dev effectively now is. &nbsp;Sometimes, =
a curator only finds a technical data issue after the initial load while =
browsing the newly loaded data on the dev site.</div><br><blockquote =
type=3D"cite"><div><div><blockquote type=3D"cite">So the load doesn't =
fail until the end of the run, which is inefficient from a maintenance =
perspective. &nbsp;I've been thinking of adding a `--failfast` option =
for use on the back end. &nbsp;Haven't done it yet. &nbsp;I started a =
load yesterday in fact that ran 2 hours before it buffered an exception =
related to a newly introduced bug. &nbsp;I fixed the bug and ran the =
load again. &nbsp;It finished sometime between COB yesterday and this =
morning (successfully!).<br></blockquote><br>Alright I am trying to =
reconcile this with from below, 'The largest studies take just under a =
minute'.<br></div></div></blockquote><div><br></div><div>The context of =
the 'The largest studies take just under a minute' statement is that =
it's not loading the hefty/time-consuming raw data. &nbsp;It's only =
validating the metadata. &nbsp;That's fast (5-60s). &nbsp;And that data =
is a portion of the transaction in the back-end load. &nbsp;There are =
errors that validation can miss that are due to not touching the raw =
data, and in fact, those errors are addressed by curators editing the =
excel sheets. &nbsp;That's why it's all in the load transaction instead =
of loaded separately, but those problems are somewhat rare (and we =
currently have a new feature in the design phase that should almost =
completely eliminate those issues).</div><br><blockquote =
type=3D"cite"><div><div><blockquote type=3D"cite"><blockquote =
type=3D"cite">Seems you are looking for some sort of queuing =
system.<br><br>What are the time constraints for getting the validation =
turned around.<br></blockquote>I have considered a queuing system, =
though when I previously floated a proof of concept using celery, I was =
informed it was too much. &nbsp;Though, at the time, all I was trying to =
do was a progress bar for a query stats feature. &nbsp;So proposing =
celery in this instance may get more traction with the rest of the =
team.<br>Most of the small validation processes finish in under a dozen =
seconds. &nbsp;&nbsp;The largest studies take just under a minute. =
&nbsp;I have plans to optimize the loading scripts that hopefully could =
get the largest studies down to a dozen seconds. &nbsp;If I could do =
that, and do the back end loads in off-peak hours, then I'd be willing =
to suffer the rare timeouts from concurrent validations. &nbsp;The raw =
data loads will still likely take a much longer =
time.<br></blockquote><br>This is where I get confused, probably because =
I am not exactly sure what constitutes validation. My sense is that =
involves a load of data into live tables and seeing what fails PK, FK or =
other constraints.<br><br>If that is the case I am not seeing how the =
'for real' data load would be =
longer?<br></div></div></blockquote><div><br></div><div>The validation =
skips the time-consuming raw data load. &nbsp;That raw data is =
collectively hundreds of gigs in size and could not be uploaded on the =
validation page anyway. &nbsp;The feature I alluded to above that would =
make errors associated with the raw data almost completely eliminated is =
one where the researcher can drop the raw data folder into the form and =
it just walks the directory to get all the raw data file names and =
relative paths. &nbsp;It's those data relationships whose validations =
are currently skipped.</div><br><blockquote type=3D"cite"><div><div>At =
any rate I can't see how loading into a live database multiple sets of =
data while operations are going on in the database can be made conflict =
free. To me &nbsp;it seems the best that be done is:<br><br>1) Reduce =
chance for conflict by spreading the actions out.<br><br>2) Have retry =
logic that deals with =
conflicts.<br></div></div></blockquote></div><br><div>I'm unfamiliar =
with retry functionality, but those options sound logical to me as a =
good path forward, particularly using celery to spread out validations =
and doing the back end loads at night (or using some sort of fast =
dump/load). &nbsp;The thing that bothers me about the celery solution is =
that most of the time, 2 users validating different data will not block, =
so I would be making users wait for no reason. &nbsp;Ideally, I could =
anticipate the block and only at that point, separate those =
validations.</div><div><br></div><div>This brings up a question though =
about a possibility I suspect is not practical. &nbsp;My initial read of =
the isolation levels documentation found this section really =
promising:</div><div><br></div><div>&gt;&nbsp;The Repeatable Read =
isolation level only sees data committed before the transaction began; =
it never sees either uncommitted data or changes committed during =
transaction execution by concurrent =
transactions.</div><div><br></div><div>This was before I realized that =
the actions of the previously started transaction would include "locks" =
that would block validation even though the load transaction hasn't =
committed yet:</div><div><br></div><div>&gt;&nbsp;a target row might =
have already been updated (or deleted or <b>locked</b>) by another =
concurrent transaction by the time it is found. In this case, the =
repeatable read transaction will wait for the first updating transaction =
to commit or roll back</div><div><br></div><div>Other documentation I =
read referred to the state of the DB (when a transaction starts) as a =
"snapshot" and I thought... what if I could save such a snapshot =
automatically just&nbsp;<b>before</b> a back-end load starts, and use =
that snapshot for validation, such that my validation processes could =
use that to validate against and not encounter any locks? &nbsp;The =
validation will never commit, so there's no =
risk.</div><div><br></div><div>I know Django's ORM wouldn't support =
that, but I kind of hoped that someone in this email list might suggest =
a snapshot functionality as a possible solution. &nbsp;Since the =
validations never commit, the only downside would be if the backend load =
changed something that introduces a problem with the validated data that =
would not be fixed until we actually attempt to load =
it.</div><div><br></div><div>Is that too science-fictiony of an =
idea?</div></body></html>=

--Apple-Mail=_82E64062-A7A9-432F-B389-649CAA62999F--