Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tg8B2-007xj8-4n for pgsql-general@arkaria.postgresql.org; Thu, 06 Feb 2025 20:08:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tg8B0-00BHoh-AE for pgsql-general@arkaria.postgresql.org; Thu, 06 Feb 2025 20:08:42 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tg8Az-00BHoS-Na for pgsql-general@lists.postgresql.org; Thu, 06 Feb 2025 20:08:41 +0000 Received: from mail-qt1-x832.google.com ([2607:f8b0:4864:20::832]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1tg8Ax-0045nc-0B for pgsql-general@postgresql.org; Thu, 06 Feb 2025 20:08:41 +0000 Received: by mail-qt1-x832.google.com with SMTP id d75a77b69052e-467a17055e6so16762321cf.3 for ; Thu, 06 Feb 2025 12:08:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=princeton.edu; s=google; t=1738872518; x=1739477318; darn=postgresql.org; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=WJ6asCQqexD/yt6kdRolGmxiy9o0EUGFCptl2FiozGc=; b=DY2EfmkSSNYInzUzg3i0uXouoc5Av3qt3QxatBD5sb18CvEFf87RTYdpqtyq5sVpGT GpUDXamBQ4L1ydpiRBBtJ0p+AL/lrhzEnActa8qcM34Q/57rf/tXeOTI3xHFRx5cIkB7 Pw3wm1GUlBjVXkF+30PcgIGsSXJy6oGxrSgo9dH6VxAYq5Lj5Ee15IMxXfwVZrY3drNB Slz9v0mxxnKWc4W5qqBy83C4SbXc3xTMphfZj2pZG1DBK2fo1yc3rQVz19QjwsmAi8VV S6Onw+qAVinAzB5QqC10QoV5xC3IQ8sMEfKa9M/2wV8K83ndJ2B1Q6nXlI8a/igdLW/d Ed8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738872518; x=1739477318; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=WJ6asCQqexD/yt6kdRolGmxiy9o0EUGFCptl2FiozGc=; b=Yum2E+VNShGldTlGLz5WAX70na9PA/RZFCKQqQhxP+h2UJaEsui21BFrD9dkc74Y3d 5DlIAsvGb/75g1YKLK+eDoqCLWUERreNiqPfohgG5nQVPY2MdA3by08TJNk5xIInI6E8 1FzWYwNxE77rv84czBcRe6oiD32Ndkn7ReHYD1IgduM9/ws6PU3OlisxiVB+/VrcuwAx /qHn9RZMy8kdZQMglBqBscR/btDdmmfTXKBIFHThUBq4b948AM1oHiXLMejqZ+nFeMtW RebBnBb38GO79xCYdliI70+fpEZHMS5a/gBoq/xK1rlfEzNO0KZQWBlC3LDBV2KWlV4E ADaQ== X-Gm-Message-State: AOJu0Yzixx/vLyCvAZCcdapdNC9Io4sUDyFzItfHuloTBGBTQuHbWXpI Y9PdfKogN34krR+1Agmy8V00i6rhHOPCUU0zvqHlZngxen/QyIynpA7xreFPj/I= X-Gm-Gg: ASbGnctNAexh0/JDEBaS5mU8nS13wKJF3Ve1T4dXZ4bLBL29p9lSsJ8OWOh/zyoh6cn jgQO2r3rGNxDLNyr40rSczThQyPisR/xJNHNlSXT/0BVh8zHfVEiqCOr8f9D+yhbU9xbQPS29Gk QV/Eoj57xP90aYXPcSuSxCxmTMZB6MEsOZc39g6snNq6c26ZPXWya+b40FsTdaAZUW25IIS2US8 NE218XeZcfNkcfE4grFj1f7HmY1ftfM4ZSCOQiGHjh8PZ0WXKdB4HouBpzRAt/QS2zAyHcqQ71P 0L7fjHZAcL4/0PD+hMsvNp2GyX4GRrwJ6Vzb X-Google-Smtp-Source: AGHT+IFG923b5ZxlqCJgguGWqIKxKI3TiO4syMAcyfAO8sB47QoR7caix0G2FP2TkgIwmlL1HQfbYg== X-Received: by 2002:a05:622a:1114:b0:46e:23ad:a120 with SMTP id d75a77b69052e-47167bc5b16mr10050531cf.35.1738872517995; Thu, 06 Feb 2025 12:08:37 -0800 (PST) Received: from smtpclient.apple ([165.1.205.67]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-47149275f84sm8515151cf.4.2025.02.06.12.08.37 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 06 Feb 2025 12:08:37 -0800 (PST) From: Robert Leach Message-Id: <00DF52D1-5ED0-4599-8480-8C671434EE4E@princeton.edu> Content-Type: multipart/alternative; boundary="Apple-Mail=_82E64062-A7A9-432F-B389-649CAA62999F" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.500.171.1.1\)) Subject: Re: How to perform a long running dry run transaction without blocking Date: Thu, 6 Feb 2025 15:08:26 -0500 In-Reply-To: Cc: pgsql-general To: Adrian Klaver References: <88d60ace-45e6-4d41-afc4-113df7219c4d@aklaver.com> <4000D0EE-B250-4E9E-831F-00C034D6D0B5@princeton.edu> <6d833658-f461-4ad4-a3e1-86d3c515bc18@aklaver.com> <0FE9C709-A108-4ED5-8132-B802B8D9908F@princeton.edu> X-Mailer: Apple Mail (2.3774.500.171.1.1) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --Apple-Mail=_82E64062-A7A9-432F-B389-649CAA62999F Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii >>> The load to the development server does no validation? >>>=20 >>> If so what is the purpose? >>>=20 >>> The background processes are other validation runs? >> It's the same code that executes in both cases (with or without the = `--validate` flag). All that that flag does is it (effectively) raises = the dry run exception before it leaves the transaction block, so it = always validates (whether the flag is supplied or not). >=20 > More for my sake then anything else, why do the load to the = development server at all if the production load is the only one that = counts? The software is still in a new major version beta. We're adding = features and fixing bugs. It's not unusual to encounter a new bug, fix = it on dev to get the load to work, then deploy a point release on prod. = And that means repeated load attempts that interfere with the validation = interface. Besides, beyond this, we're planning on a separate staging = database that dev effectively now is. Sometimes, a curator only finds a = technical data issue after the initial load while browsing the newly = loaded data on the dev site. >> So the load doesn't fail until the end of the run, which is = inefficient from a maintenance perspective. I've been thinking of = adding a `--failfast` option for use on the back end. Haven't done it = yet. I started a load yesterday in fact that ran 2 hours before it = buffered an exception related to a newly introduced bug. I fixed the = bug and ran the load again. It finished sometime between COB yesterday = and this morning (successfully!). >=20 > Alright I am trying to reconcile this with from below, 'The largest = studies take just under a minute'. The context of the 'The largest studies take just under a minute' = statement is that it's not loading the hefty/time-consuming raw data. = It's only validating the metadata. That's fast (5-60s). And that data = is a portion of the transaction in the back-end load. There are errors = that validation can miss that are due to not touching the raw data, and = in fact, those errors are addressed by curators editing the excel = sheets. That's why it's all in the load transaction instead of loaded = separately, but those problems are somewhat rare (and we currently have = a new feature in the design phase that should almost completely = eliminate those issues). >>> Seems you are looking for some sort of queuing system. >>>=20 >>> What are the time constraints for getting the validation turned = around. >> I have considered a queuing system, though when I previously floated = a proof of concept using celery, I was informed it was too much. = Though, at the time, all I was trying to do was a progress bar for a = query stats feature. So proposing celery in this instance may get more = traction with the rest of the team. >> Most of the small validation processes finish in under a dozen = seconds. The largest studies take just under a minute. I have plans = to optimize the loading scripts that hopefully could get the largest = studies down to a dozen seconds. If I could do that, and do the back = end loads in off-peak hours, then I'd be willing to suffer the rare = timeouts from concurrent validations. The raw data loads will still = likely take a much longer time. >=20 > This is where I get confused, probably because I am not exactly sure = what constitutes validation. My sense is that involves a load of data = into live tables and seeing what fails PK, FK or other constraints. >=20 > If that is the case I am not seeing how the 'for real' data load would = be longer? The validation skips the time-consuming raw data load. That raw data is = collectively hundreds of gigs in size and could not be uploaded on the = validation page anyway. The feature I alluded to above that would make = errors associated with the raw data almost completely eliminated is one = where the researcher can drop the raw data folder into the form and it = just walks the directory to get all the raw data file names and relative = paths. It's those data relationships whose validations are currently = skipped. > At any rate I can't see how loading into a live database multiple sets = of data while operations are going on in the database can be made = conflict free. To me it seems the best that be done is: >=20 > 1) Reduce chance for conflict by spreading the actions out. >=20 > 2) Have retry logic that deals with conflicts. I'm unfamiliar with retry functionality, but those options sound logical = to me as a good path forward, particularly using celery to spread out = validations and doing the back end loads at night (or using some sort of = fast dump/load). The thing that bothers me about the celery solution is = that most of the time, 2 users validating different data will not block, = so I would be making users wait for no reason. Ideally, I could = anticipate the block and only at that point, separate those validations. This brings up a question though about a possibility I suspect is not = practical. My initial read of the isolation levels documentation found = this section really promising: > The Repeatable Read isolation level only sees data committed before = the transaction began; it never sees either uncommitted data or changes = committed during transaction execution by concurrent transactions. This was before I realized that the actions of the previously started = transaction would include "locks" that would block validation even = though the load transaction hasn't committed yet: > a target row might have already been updated (or deleted or locked) by = another concurrent transaction by the time it is found. In this case, = the repeatable read transaction will wait for the first updating = transaction to commit or roll back Other documentation I read referred to the state of the DB (when a = transaction starts) as a "snapshot" and I thought... what if I could = save such a snapshot automatically just before a back-end load starts, = and use that snapshot for validation, such that my validation processes = could use that to validate against and not encounter any locks? The = validation will never commit, so there's no risk. I know Django's ORM wouldn't support that, but I kind of hoped that = someone in this email list might suggest a snapshot functionality as a = possible solution. Since the validations never commit, the only = downside would be if the backend load changed something that introduces = a problem with the validated data that would not be fixed until we = actually attempt to load it. Is that too science-fictiony of an idea?= --Apple-Mail=_82E64062-A7A9-432F-B389-649CAA62999F Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii
The load to the development = server does no validation?

If so what is the purpose?

The = background processes are other validation runs?
It's the = same code that executes in both cases (with or without the `--validate` = flag).  All that that flag does is it (effectively) raises the dry = run exception before it leaves the transaction block, so it always = validates (whether the flag is supplied or = not).

More for my sake then anything else, why do = the load to the development server at all if the production load is the = only one that = counts?

The software is = still in a new major version beta.  We're adding features and = fixing bugs.  It's not unusual to encounter a new bug, fix it on = dev to get the load to work, then deploy a point release on prod. =  And that means repeated load attempts that interfere with the = validation interface.  Besides, beyond this, we're planning on a = separate staging database that dev effectively now is.  Sometimes, = a curator only finds a technical data issue after the initial load while = browsing the newly loaded data on the dev site.

So the load doesn't = fail until the end of the run, which is inefficient from a maintenance = perspective.  I've been thinking of adding a `--failfast` option = for use on the back end.  Haven't done it yet.  I started a = load yesterday in fact that ran 2 hours before it buffered an exception = related to a newly introduced bug.  I fixed the bug and ran the = load again.  It finished sometime between COB yesterday and this = morning (successfully!).

Alright I am trying to = reconcile this with from below, 'The largest studies take just under a = minute'.

The context of = the 'The largest studies take just under a minute' statement is that = it's not loading the hefty/time-consuming raw data.  It's only = validating the metadata.  That's fast (5-60s).  And that data = is a portion of the transaction in the back-end load.  There are = errors that validation can miss that are due to not touching the raw = data, and in fact, those errors are addressed by curators editing the = excel sheets.  That's why it's all in the load transaction instead = of loaded separately, but those problems are somewhat rare (and we = currently have a new feature in the design phase that should almost = completely eliminate those issues).

Seems you are looking for some sort of queuing = system.

What are the time constraints for getting the validation = turned around.
I have considered a queuing system, = though when I previously floated a proof of concept using celery, I was = informed it was too much.  Though, at the time, all I was trying to = do was a progress bar for a query stats feature.  So proposing = celery in this instance may get more traction with the rest of the = team.
Most of the small validation processes finish in under a dozen = seconds.   The largest studies take just under a minute. =  I have plans to optimize the loading scripts that hopefully could = get the largest studies down to a dozen seconds.  If I could do = that, and do the back end loads in off-peak hours, then I'd be willing = to suffer the rare timeouts from concurrent validations.  The raw = data loads will still likely take a much longer = time.

This is where I get confused, probably because = I am not exactly sure what constitutes validation. My sense is that = involves a load of data into live tables and seeing what fails PK, FK or = other constraints.

If that is the case I am not seeing how the = 'for real' data load would be = longer?

The validation = skips the time-consuming raw data load.  That raw data is = collectively hundreds of gigs in size and could not be uploaded on the = validation page anyway.  The feature I alluded to above that would = make errors associated with the raw data almost completely eliminated is = one where the researcher can drop the raw data folder into the form and = it just walks the directory to get all the raw data file names and = relative paths.  It's those data relationships whose validations = are currently skipped.

At = any rate I can't see how loading into a live database multiple sets of = data while operations are going on in the database can be made conflict = free. To me  it seems the best that be done is:

1) Reduce = chance for conflict by spreading the actions out.

2) Have retry = logic that deals with = conflicts.

I'm unfamiliar = with retry functionality, but those options sound logical to me as a = good path forward, particularly using celery to spread out validations = and doing the back end loads at night (or using some sort of fast = dump/load).  The thing that bothers me about the celery solution is = that most of the time, 2 users validating different data will not block, = so I would be making users wait for no reason.  Ideally, I could = anticipate the block and only at that point, separate those = validations.

This brings up a question though = about a possibility I suspect is not practical.  My initial read of = the isolation levels documentation found this section really = promising:

> The Repeatable Read = isolation level only sees data committed before the transaction began; = it never sees either uncommitted data or changes committed during = transaction execution by concurrent = transactions.

This was before I realized that = the actions of the previously started transaction would include "locks" = that would block validation even though the load transaction hasn't = committed yet:

> a target row might = have already been updated (or deleted or locked) by another = concurrent transaction by the time it is found. In this case, the = repeatable read transaction will wait for the first updating transaction = to commit or roll back

Other documentation I = read referred to the state of the DB (when a transaction starts) as a = "snapshot" and I thought... what if I could save such a snapshot = automatically just before a back-end load starts, and use = that snapshot for validation, such that my validation processes could = use that to validate against and not encounter any locks?  The = validation will never commit, so there's no = risk.

I know Django's ORM wouldn't support = that, but I kind of hoped that someone in this email list might suggest = a snapshot functionality as a possible solution.  Since the = validations never commit, the only downside would be if the backend load = changed something that introduces a problem with the validated data that = would not be fixed until we actually attempt to load = it.

Is that too science-fictiony of an = idea?
= --Apple-Mail=_82E64062-A7A9-432F-B389-649CAA62999F--