Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.80) (envelope-from ) id 1Wsi5N-0007sw-7s for pgsql-docs@arkaria.postgresql.org; Fri, 06 Jun 2014 00:32:45 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.80) (envelope-from ) id 1Wsi5M-00034u-JL for pgsql-docs@arkaria.postgresql.org; Fri, 06 Jun 2014 00:32:44 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1Wsi5L-00034c-R7 for pgsql-docs@postgresql.org; Fri, 06 Jun 2014 00:32:44 +0000 Received: from mail-we0-x22a.google.com ([2a00:1450:400c:c03::22a]) by magus.postgresql.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.80) (envelope-from ) id 1Wsi5I-0005Kq-01 for pgsql-docs@postgresql.org; Fri, 06 Jun 2014 00:32:42 +0000 Received: by mail-we0-f170.google.com with SMTP id u57so2052045wes.29 for ; Thu, 05 Jun 2014 17:32:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=vXhgnC8roRFSmrD6oUU76El+9ffIiOoM0RSox2SmvF0=; b=GdXVhS40xmvbDoJoDR0QTORA5xCBbCV6O6LAa5+uBIBUM0PNDCT4QQ2vi3iAWaqh0H qktFGgSjzxOlzZLgIe/1Q3kZ3782f+jjOrKiu56qlupK9aUqMDLn8+WnrUfMw2cgjFGJ 1eJ0UQqJ+bzpZQQlBd/08IrasjYsxTARfI+x9Porp+QSZP8qxve8pfc37mG57tcu5KPV fiB7VJ88Lcg24SEbR/9zfCrR1t3Ial1xIUonI9lMur0z8E5bRa73bjIsjAptuOSVTGlX 8COZnFTc6/MpEyEI5cCyyHkbDqm4olW0wwY+zwRgBrpk9dth9OXCsAKCrInRVV6M2ajr Xb7w== MIME-Version: 1.0 X-Received: by 10.194.48.38 with SMTP id i6mr1639879wjn.46.1402014758728; Thu, 05 Jun 2014 17:32:38 -0700 (PDT) Received: by 10.217.1.145 with HTTP; Thu, 5 Jun 2014 17:32:38 -0700 (PDT) In-Reply-To: <28673.1402012845@sss.pgh.pa.us> References: <1402011281941-5806271.post@n5.nabble.com> <28673.1402012845@sss.pgh.pa.us> Date: Thu, 5 Jun 2014 20:32:38 -0400 Message-ID: Subject: Re: Some qualms with the current description of RegExp s,n,w modes. From: David Johnston To: Tom Lane Cc: pgsql-docs@postgresql.org Content-Type: multipart/alternative; boundary=047d7bacbdaaa9f86404fb1fffe8 X-Pg-Spam-Score: -2.0 (--) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-docs Precedence: bulk Sender: pgsql-docs-owner@postgresql.org --047d7bacbdaaa9f86404fb1fffe8 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Thu, Jun 5, 2014 at 8:00 PM, Tom Lane wrote: > David G Johnston writes: > > I simplified ". and bracket expressions" to "wildcard" and "^ and $" to > > "anchors" though did make use of ^ and $individual quite a bit. I did > not > > formally define these terms in the body either. > > Did you mean to attach a proposed doc patch here, or are you just > armwaving about what a patch might look like? > =E2=80=8BArmwaving for lack of any current setup to generate doc-patches.= =E2=80=8B > FWIW, I don't agree with using "wildcard" to mean those particular things > (the term is too generic, and there are other regex constructs that > might be thought to be included); although you could probably get away > with using "anchor" this way as long as you define the term at first use. > > =E2=80=8BI had the same nagging suspicion but figured for a first pass, and= defined only within this context, it would suffice. ". and ^ brackets" just rubbed me the wrong way but it does have the merit of being precise.=E2=80=8B > The text involved here is more or less verbatim from Henry Spencer's > original man page for the regex library, so you're essentially claiming > you know more than the author did about what his code is good for. Maybe > so, but some examples in support of your thesis would be a good thing. > =E2=80=8BI can readily support why I found [w] to be most useful; the concl= usion that [w] > [s] came from the logic that making "^ and $" useless means that using [w] mode and simply avoiding using them would have the same effect. I'll admit that people using ^ and $ where they really meant \A and \Z may be an issue worth accounting for...but I personally call providing that mode to be a compatibility/help-oriented decision and just decided to state so in my revision. Example that prompted this whole journey: WITH src (filecontent) AS ( VALUES( $$CDF CORR: DRAIN COOLANT AND REFILL ADDITIONAL DLR-OP: BGFLDEX PAY TYPE: C OTH HRS: 0000 FORECAST SERVICE: CHG TO: EPA CHG: HAZ CHG: 9999 5 SPG CONVERSION SETTINGS - SPG MFG: -- GEN MOD: -- VIN/MODEL#: ENGINE: CDR CORR: CUSTOMER ELECTED NOT TO HAVE REPAIRS DONE AT THIS TIME NOS PAY TYPE: C OTH HRS: 0000 FORECAST SERVICE: CHG TO: EPA CHG: HAZ CHG: 9999 03 0030 SPG CONVERSION SETTINGS - SPG MFG: -- GEN MOD: -- VIN/MODEL#: ENGINE: $$::varchar )) , do_match AS ( SELECT regexp_matches(filecontent,'^(\S.*?)(?=3D^\S|\Z)','gw') AS match FRO= M src ) , explode_match AS ( SELECT unnest(match) FROM do_match ) SELECT unnest, length(unnest) FROM explode_match; [s] 1 result because the "^\S" construct attempts to match beginning-of-document instead of beginning-of-line. This is when I started digging deeper since I expected it to behave like [w]. [n] 0 results because the (.*?) never gets beyond the first line and thus cannot match "^\S|\Z" - no problem here, the behavior of "." is as expected= . [w] 2 results as desired/expected. It is possible to replace ^\S with \n\S (and thus allow [s] to work) but the semantic meaning of ^ makes using this form more convenient Note that CDF has 5 rows of content while CDR only has 4; thus strongly suggesting the use of newline-insensitive "wildcard" matching. The choice of anchor mode is of a cosmetic/semantic nature but I argue that in this situation the semantic of [w] are preferred over [n]. In either case I'd rather simply drop the existing commentary that [w] is not that useful and either in words or example explain when it would have use; even if you do not want to go as far as to claim that [w] is superior to [n] as I would. While it is likely possible to write a working expression in all three modes my experience - which is largely based in executing these expressions in Java, not PostgreSQL thought that is becoming more common nowadays - led me directly to the regexp provided. > > Instead of calling these "partial" and "inverse partial" better terms > would > > be "newline-sensitive wildcard matching" and "newline-sensitive anchor > > matching". > > Agreed that "partial" is not a very good name, but I remain resistant to > "wildcard" here. > > > The default mode could be called "newline-sensitive full > > matching". > > Or just "newline-sensitive matching" ... does "full" add anything? > > =E2=80=8BNot much - though after adding "anchor" and "wildcard" to the othe= rs the question became if this option is not only one of those then is it both, or neither? Full makes it clear that it means both. Maybe something like: [s] - single-line mode; [w] - multi-line mode; [n|m] - document-only mode; though I dislike re-associating multi-line with [w] given its current association with [n|m]. "Record Mode [w]" has some merit since that is at least the use case that I have identified where it is particularly useful... David J. =E2=80=8B --047d7bacbdaaa9f86404fb1fffe8 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On Thu, Jun 5, 2014 a= t 8:00 PM, Tom Lane &l= t;tgl@sss.pgh.pa.us<= /a>> wrote:
David G Johnston <david.= g.johnston@gmail.com> writes:
> I simplified ". and bracket expressions" to "wildcard&q= uot; and "^ and $" to
> "anchors" though did make use of ^ and $individual quite a b= it. =C2=A0I did not
> formally define these terms in the body either.

Did you mean to attach a proposed doc patch here, or are you just
armwaving about what a patch might look like?

=E2=80=8BArmwaving for lack of any current setup to generate doc-= patches.=E2=80=8B


FWIW, I don't agree with using "wildcard" to mean those parti= cular things
(the term is too generic, and there are other regex constructs that
might be thought to be included); although you could probably get away
with using "anchor" this way as long as you define the term at fi= rst use.


=E2=80=8BI had the same nagging sus= picion but figured for a first pass, and defined only within this context, = it would suffice. =C2=A0". and ^ brackets" just rubbed me the wro= ng way but it does have the merit of being precise.=E2=80=8B

=C2=A0
The text involved here is more or less verbatim from Henry Spencer's original man page for the regex library, so you're essentially claiming=
you know more than the author did about what his code is good for. =C2=A0Ma= ybe
so, but some examples in support of your thesis would be a good thing.
<= /blockquote>

=E2=80=8BI can readily support why I fou= nd [w] to be most useful; the conclusion that [w] > [s] came from the lo= gic that making "^ and $" useless means that using [w] mode and s= imply avoiding using them would have the same effect. =C2=A0I'll admit = that people using ^ and $ where they really meant \A and \Z may be an issue= worth accounting for...but I personally call providing that mode to be a c= ompatibility/help-oriented decision and just decided to state so in my revi= sion.

Example that prompted this whole journey:

WITH src (filecontent) AS = ( VALUES(
$$CDF =C2=A0 =C2=A0 =C2=A0CORR: DRAIN COOLANT AND= REFILL
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ADDITIONAL DLR-OP: BGFLDEX
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0PAY TYPE: C =C2=A0 =C2=A0 =C2=A0 = =C2=A0 OTH HRS: 0000 =C2=A0 =C2=A0FORECAST SERVICE: =C2=A0 =C2=A0 =C2=A0CHG= TO: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 EPA CHG: =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 HAZ CHG:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A09999 =C2=A0 =C2=A0 5 =C2=A0 =C2=A0=C2= =A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0SPG CONVERSION SETTIN= GS - SPG MFG: -- =C2=A0GEN MOD: -- =C2=A0VIN/MODEL#: =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 ENGINE:


CDR =C2=A0 =C2=A0 =C2=A0CORR:= CUSTOMER ELECTED NOT TO HAVE REPAIRS DONE AT THIS TIME =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0NOS
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0PAY TYPE: C =C2=A0 =C2=A0 =C2=A0 =C2= =A0 OTH HRS: 0000 =C2=A0 =C2=A0FORECAST SERVICE: =C2=A0 =C2=A0 =C2=A0CHG TO= : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 EPA CHG: =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 HAZ CHG:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A09999 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 03 0030
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0SPG CONVERSION SETTINGS - SPG MFG: -- =C2=A0GEN MOD: -- =C2=A0VIN/MOD= EL#: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ENGINE:
$$::varchar
))
, do_match AS (
SELECT regexp_matches(filecontent,'^(\S.*?)(?=3D^\S|\Z)','gw&#= 39;) AS match FROM src
)
, explode_match AS (
SELECT unnest(match) FROM do_ma= tch
)
SELECT unnest, length(unnest) FROM ex= plode_match;

[s] 1 result because the "^\S" construct= attempts to match beginning-of-document instead of beginning-of-line. =C2= =A0This is when I started digging deeper since I expected it to behave like= [w].
[n] 0 results because the (.*?) never gets beyond the first line and th= us cannot match "^\S|\Z" - no problem here, the behavior of "= ;." is as expected.
[w] 2 results as desired/expected. =C2=A0It is possible to replace ^\S = with \n\S (and thus allow [s] to work) but the semantic meaning of ^ makes = using this form more convenient

Note that CDF has 5 rows of con= tent while CDR only has 4; thus strongly suggesting the use of newline-inse= nsitive "wildcard" matching. =C2=A0The choice of anchor mode is o= f a cosmetic/semantic nature but I argue that in this situation the semanti= c of [w] are preferred over [n].

In either case I'd rather s= imply drop the existing commentary that [w] is not that useful and either i= n words or example explain when it would have use; even if you do not want = to go as far as to claim that [w] is superior to [n] as I would.

While it is likely possible to = write a working expression in all three modes my experience - which is larg= ely based in executing these expressions in Java, not PostgreSQL thought th= at is becoming more common nowadays - led me directly to the regexp provide= d.


> Instead of calling these "partial" and "inverse partial= " better terms would
> be "newline-sensitive wildcard matching" and "newline-s= ensitive anchor
> matching".

Agreed that "partial" is not a very good name, but I remain resis= tant to
"wildcard" here.

> The default mode could be called "newline-sensitive full
> matching".

Or just "newline-sensitive matching" ... does "full" ad= d anything?


=E2=80=8BNot much - though after adding "anchor&q= uot; and "wildcard" to the others the question became if this opt= ion is not only one of those then is it both, or neither? =C2=A0Full makes = it clear that it means both.

Maybe something like: [s] - single-line mode; [w] - multi-l= ine mode; [n|m] - document-only mode; though I dislike re-associating multi= -line with [w] given its current association with [n|m]. =C2=A0"Record= Mode [w]" has some merit since that is at least the use case that I h= ave identified where it is particularly useful...

David J.
=E2=80=8B

--047d7bacbdaaa9f86404fb1fffe8--