Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tg68w-007hrO-8b for pgsql-general@arkaria.postgresql.org; Thu, 06 Feb 2025 17:58:26 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tg68v-009wNL-Bv for pgsql-general@arkaria.postgresql.org; Thu, 06 Feb 2025 17:58:25 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tg68u-009wN4-Pa for pgsql-general@lists.postgresql.org; Thu, 06 Feb 2025 17:58:25 +0000 Received: from mail-qk1-x733.google.com ([2607:f8b0:4864:20::733]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1tg68s-003Za1-01 for pgsql-general@postgresql.org; Thu, 06 Feb 2025 17:58:23 +0000 Received: by mail-qk1-x733.google.com with SMTP id af79cd13be357-7b6c3629816so60638785a.1 for ; Thu, 06 Feb 2025 09:58:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=princeton.edu; s=google; t=1738864701; x=1739469501; darn=postgresql.org; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=NI9ZM6yPss9pCl67FxuZpNHiwWEwPv3aQdFXPWT70FY=; b=DoFckyHChHrrcYDVXWtaL3GtF2/tgKL0bqbAN/eABXCb52tnOzmbGC4dgy9TJCaPkn xuulh1KmjPF3uzLwhzroSZiQaTlJ1esxS3Eyjf0AZxJPR3vROEVFtZ/8TWoFW/PXKG5j dVpiQHt1mytCKQ0reZ2HJx6zr3gN63cDXA+AkbeKr7WIG4ayopCGBIqCchf/Wlv+tzF1 t1z3SJ6xxWICiLYmNG4ZLYf9TwaHsOSlqRIqghIdPZP7EApekjAB80AcNXDlKZbAa112 00TqcP5aVgkQGnBPaWxjl7EhthaNPeusY0T7aqIAKqpWZN+4bF0Le3UL3cZkLi/B/Tw0 X/5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738864701; x=1739469501; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=NI9ZM6yPss9pCl67FxuZpNHiwWEwPv3aQdFXPWT70FY=; b=MUQa3JPrfTG+b/Z1/Q/qme5ScJsoeKI8f1gBZQzBI8zyPzyxt7/qO/jMgSprt+uroQ CkzQ8830wxslXfVtAX5u2wYGpL2eUUhe2dV7JKiMsXNaOt5QuJySQmy7p0iQXkSk9hGU Lw6ND1zAbfL3GxfgmQawGJFbXZcx/bSEcwzIlokbC5gjH5lCs6A0HFjblRY3AqQgyrau O0RMODcv6aMpNg0SvrlI7O6/MFBoXZsubHhpG7lZvPJXKwHyBZr87P5Zydh0dc+DcleA GCtwBBZt/QqwDFhQyZj0yEPBwTOMEi9bTmZQjKjW3GyVj+XOb5VlNAH4eFN87FATqqzH 2hGQ== X-Gm-Message-State: AOJu0YzDutj/kLhXJtaOB64I5Gw1ZGEWq+AO40YNNVCQdVQB6ZXCz6Hd haWwR2PqBTtFLK3jopkbAEjpGKOQc0TCCuhz5dHdgyrMZIc3e3GckYoegiye6CM= X-Gm-Gg: ASbGncu8jUHMbUCT5KXrTlRv95ipGIowBzuWVtQpZeMs3GqefuXpjijaH+RidEC7OJ1 UEZzkjfo0M9xTMKz/nM2aQCQdBh3VM0lNU39WEP8ubUUFfnIIz6x4cSeNEIm/NA4wx3vzGkSigo Nnb+z8Klb9aPtZH7LECbaAif8ndZm6fhHpStsl5BRLXshGvoxYg3rIYuCWq1BwQ27QkbOc98S+P +dR5le+90sqndEYQ7n+dEDTn/7SS5b8u0ra5WTRfoYiSjsMBWzgornw8YahDbVCEGPIipePnl7H R1r7Fum6IjESQwlk0BYNsQMv3N0l/Y0= X-Google-Smtp-Source: AGHT+IH07u9du++/lTTu5/IlFO7ihifXgKnN2kYg93t4nZf70IZY3mewk/JYfioL4CIhRKKu4s/S8w== X-Received: by 2002:a05:620a:44d6:b0:7be:98a9:298 with SMTP id af79cd13be357-7c047bbc677mr11901085a.19.1738864701260; Thu, 06 Feb 2025 09:58:21 -0800 (PST) Received: from smtpclient.apple ([165.1.205.67]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7c041dfb1efsm85354585a.32.2025.02.06.09.58.19 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 06 Feb 2025 09:58:20 -0800 (PST) From: Robert Leach Message-Id: <0FE9C709-A108-4ED5-8132-B802B8D9908F@princeton.edu> Content-Type: multipart/alternative; boundary="Apple-Mail=_ABF3AA76-48F4-416D-ABFA-86E39773A17A" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.500.171.1.1\)) Subject: Re: How to perform a long running dry run transaction without blocking Date: Thu, 6 Feb 2025 12:58:08 -0500 In-Reply-To: <6d833658-f461-4ad4-a3e1-86d3c515bc18@aklaver.com> Cc: pgsql-general To: Adrian Klaver References: <88d60ace-45e6-4d41-afc4-113df7219c4d@aklaver.com> <4000D0EE-B250-4E9E-831F-00C034D6D0B5@princeton.edu> <6d833658-f461-4ad4-a3e1-86d3c515bc18@aklaver.com> X-Mailer: Apple Mail (2.3774.500.171.1.1) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --Apple-Mail=_ABF3AA76-48F4-416D-ABFA-86E39773A17A Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > Please reply to list also. > Ccing list =F0=9F=91=8D=F0=9F=8F=BB >> They enter the data in an excel spreadsheet containing about a dozen = inter-related worksheets, named, for example: Study, Animals, Samples, = Tissues, Treatments, Sequences >=20 > Long term is there a thought to have them enter directly into database = where validation can happen in real time No, not really. Sample metadata tends to be an afterthought to = researchers. They have it in their notebooks and getting them to enter = it at all is like pulling teeth. The validation interface actually has = a bunch of other features I haven't mentioned that streamline the = process for them. Before it gets to actually validating the data, it = tries to lighten the manual burden on the researchers (and help with = consistent nomenclature) by pulling sample names out of the raw files, = massaging them, and filling those in along with a mass of common data = that is used to populate drop-downs in the excel columns to avoid = researcher typos and value variants. Having everything work with excel actually made the site more attractive = to the researchers, because they're comfortable with it and use it = already, so it lowered the bar for using our software. Besides, we don't trust the users enough to enter data unsupervised. = There are a lot of aspects of the data that cannot be automatically = validated and involve experimental parameters that are adjacent to the = purpose of our site. We have curators that need to look at everything = to ensure consistency, and looking at all the data in context is = necessary before any of it is entered. That said, back in the aughts, I wrote a perl cgi site for a toxin and = virulence factor database that used a web interface for data entry and = achieved the curation goal by saving a form of all inter-related data. = The submit button sent that form to a list of curators who could approve = the insert/update and make it actually happen. I think I had actually = suggested that form of data entry when this current project first = started, but I was overruled. However, in this project, the equivalent = procedure would be per-sample, and you'd lose out on the overall = context. It's an interesting challenge, but I think we're pretty = committed now on this file load path. >>> Where are the background processes loading data to? >> We first run our loads on a development server with a separate copy = of the database, but then we re-run those same loads on the production = server, where users perform their validations. >=20 > The load to the development server does no validation? >=20 > If so what is the purpose? >=20 > The background processes are other validation runs? It's the same code that executes in both cases (with or without the = `--validate` flag). All that that flag does is it (effectively) raises = the dry run exception before it leaves the transaction block, so it = always validates (whether the flag is supplied or not). So the load doesn't fail until the end of the run, which is inefficient = from a maintenance perspective. I've been thinking of adding a = `--failfast` option for use on the back end. Haven't done it yet. I = started a load yesterday in fact that ran 2 hours before it buffered an = exception related to a newly introduced bug. I fixed the bug and ran = the load again. It finished sometime between COB yesterday and this = morning (successfully!). >> One of the thoughts I'd had to work around the problem was to somehow = dump the data from the development load and load it onto production in = some sort of scheduled downtime or something. However, even if we do = that, I'm concerned that multiple users concurrently validating = different submissions would encounter this blocking issue, and since = those validations can take (on the upper end) just under a minute, it's = enough for at least 1 user to encounter a timeout. I have not yet = proven that can happen, but based on my reading of the postgres = documentation, it seems logical. >=20 > Seems you are looking for some sort of queuing system. >=20 > What are the time constraints for getting the validation turned = around. I have considered a queuing system, though when I previously floated a = proof of concept using celery, I was informed it was too much. Though, = at the time, all I was trying to do was a progress bar for a query stats = feature. So proposing celery in this instance may get more traction = with the rest of the team. Most of the small validation processes finish in under a dozen seconds. = The largest studies take just under a minute. I have plans to optimize = the loading scripts that hopefully could get the largest studies down to = a dozen seconds. If I could do that, and do the back end loads in = off-peak hours, then I'd be willing to suffer the rare timeouts from = concurrent validations. The raw data loads will still likely take a = much longer time. Robert William Leach Research Software Engineer 133 Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544 --Apple-Mail=_ABF3AA76-48F4-416D-ABFA-86E39773A17A Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
Please = reply to list also.
Ccing = list

=F0=9F=91=8D=F0=9F=8F=BB
<= /div>
They enter the data in an excel spreadsheet containing = about a dozen inter-related worksheets, named, for example: Study, = Animals, Samples, Tissues, Treatments, Sequences =

Long term is there a thought to have them enter = directly into database where validation can happen in real = time

No, not really. =  Sample metadata tends to be an afterthought to researchers. =  They have it in their notebooks and getting them to enter it at = all is like pulling teeth.  The validation interface actually has a = bunch of other features I haven't mentioned that streamline the process = for them.  Before it gets to actually validating the data, it tries = to lighten the manual burden on the researchers (and help with = consistent nomenclature) by pulling sample names out of the raw files, = massaging them, and filling those in along with a mass of common data = that is used to populate drop-downs in the excel columns to avoid = researcher typos and value variants.

Having = everything work with excel actually made the site more attractive to the = researchers, because they're comfortable with it and use it already, so = it lowered the bar for using our = software.

Besides, we don't trust the users = enough to enter data unsupervised.  There are a lot of aspects of = the data that cannot be automatically validated and involve experimental = parameters that are adjacent to the purpose of our site.  We have = curators that need to look at everything to ensure consistency, and = looking at all the data in context is necessary before any of it is = entered.

That said, back in the aughts, I wrote = a perl cgi site for a toxin and virulence factor database that used a = web interface for data entry and achieved the curation goal by saving a = form of all inter-related data.  The submit button sent that form = to a list of curators who could approve the insert/update and make it = actually happen.  I think I had actually suggested that form of = data entry when this current project first started, but I was overruled. =  However, in this project, the equivalent procedure would be = per-sample, and you'd lose out on the overall context.  It's an = interesting challenge, but I think we're pretty committed now on this = file load path.

Where are the background = processes loading data to?
We first run our loads on a = development server with a separate copy of the database, but then we = re-run those same loads on the production server, where users perform = their validations.

The load to the development = server does no validation?

If so what is the purpose?

The = background processes are other validation = runs?

It's the same code = that executes in both cases (with or without the `--validate` flag). =  All that that flag does is it (effectively) raises the dry run = exception before it leaves the transaction block, so it always validates = (whether the flag is supplied or not).

So the = load doesn't fail until the end of the run, which is inefficient from a = maintenance perspective.  I've been thinking of adding a = `--failfast` option for use on the back end.  Haven't done it yet. =  I started a load yesterday in fact that ran 2 hours before it = buffered an exception related to a newly introduced bug.  I fixed = the bug and ran the load again.  It finished sometime between COB = yesterday and this morning (successfully!).

One of the thoughts = I'd had to work around the problem was to somehow dump the data from the = development load and load it onto production in some sort of scheduled = downtime or something.  However, even if we do that, I'm concerned = that multiple users concurrently validating different submissions would = encounter this blocking issue, and since those validations can take (on = the upper end) just under a minute, it's enough for at least 1 user to = encounter a timeout.  I have not yet proven that can happen, but = based on my reading of the postgres documentation, it seems = logical.

Seems you are looking for some sort of = queuing system.

What are the time constraints for getting the = validation turned = around.

I have = considered a queuing system, though when I previously floated a proof of = concept using celery, I was informed it was too much.  Though, at = the time, all I was trying to do was a progress bar for a query stats = feature.  So proposing celery in this instance may get more = traction with the rest of the team.

Most of the = small validation processes finish in under a dozen seconds.  The = largest studies take just under a minute.  I have plans to optimize = the loading scripts that hopefully could get the largest studies down to = a dozen seconds.  If I could do that, and do the back end loads in = off-peak hours, then I'd be willing to suffer the rare timeouts from = concurrent validations.  The raw data loads will still likely take = a much longer time.


Robert William = Leach
Research Software Engineer
133 Carl C. Icahn = Lab
Lewis-Sigler Institute for Integrative = Genomics
Princeton University
Princeton, NJ = 08544

= --Apple-Mail=_ABF3AA76-48F4-416D-ABFA-86E39773A17A--