Some qualms with the current description of RegExp s,n,w modes.

public inbox for [email protected]  
help / color / mirror / Atom feed

Some qualms with the current description of RegExp s,n,w modes.
4+ messages / 3 participants
[nested] [flat]

* Some qualms with the current description of RegExp s,n,w modes.
@ 2014-06-05 23:34  David G Johnston <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: David G Johnston @ 2014-06-05 23:34 UTC (permalink / raw)
  To: pgsql-docs

The current documentation for "n" and "w" are as follows:

[s] If partial newline-sensitive matching is specified, this affects . and
bracket expressions as with newline-sensitive matching, but not ^ and $.

[w] If inverse partial newline-sensitive matching is specified, this affects
^ and $ as with newline-sensitive matching, but not . and bracket
expressions. This isn't very useful but is provided for symmetry.

I have a specific qualm with the claim that [w] "isn't very useful".  I
would argue that if a person is appropriately exact in their usage of \A and
\Z that there is nothing [s] can do that cannot be done in [w] but that
parsing multi-record text documents becomes much cleaner if done in [w]
mode.  The terms themselves also do little to help the user understand and
remember the nuances of each mode.

I simplified ". and bracket expressions" to "wildcard" and "^ and $" to
"anchors" though did make use of ^ and $individual quite a bit.  I did not
formally define these terms in the body either.

I'm posting mostly to see if anyone else agrees with my opinions on the
matter and to gather thoughts both for and against.

Note that true symmetry would require a 4th mode - one where wildcards stop
at newlines but where anchors only match at the document level - though this
pair is of little value for much the same reason as [n].  In my mind there
are two primary modes (s, w) and one "helpful" mode (n) - no symmetry
claimed.

Instead of calling these "partial" and "inverse partial" better terms would
be "newline-sensitive wildcard matching" and "newline-sensitive anchor
matching".  The default mode could be called "newline-sensitive full
matching".  With those defined correctly elsewhere in the documentation
section 9.7.3.5 (9.3 version) could provide the following definitions:

full matching - the default - causes wildcards to stop matching at a newline
(typically denoting end-of-line) and so is often referred to as single-line
mode.  The beginning and end of each line can be referred to by using ^ and
$ respectively.  During a global match the document boundaries can be
matched  using \A and \Z.

anchor-only matching is generally useful and almost necessary for times when
newlines are not part of the content but the document being parsed has
multiple records separated by newlines (in particular if the number of
rows-per-record is variable).  The wildcard allows for selecting multiple
rows of content from each record while still being able to use the anchors
to find the beginning and end of each record.  Like in full matching mode
the document boundaries can be matched using \A and \Z.

wildcard-only matching is useful when you wish to treat newlines only as
content within a single logical document.  ^ and $ are left as synonyms for
\A and \Z respectively and so do not (typically inadvertently) match near an
embedded newline - you have to use a literal \n to do that and then deal
with the newline itself being part of the capture.  This is best thought of
as a compatibility mode since you can get the same behavior, without losing
the unique behavior of ^ and $, in anchor-only mode with proper use of \A
and \Z to match boundaries and avoid using ^ and $.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Some-qualms-with-the-current-description-of-RegExp-s-n-w-mod...
Sent from the PostgreSQL - docs mailing list archive at Nabble.com.

-- 
Sent via pgsql-docs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-docs

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Some qualms with the current description of RegExp s,n,w modes.
@ 2014-06-06 00:00  Tom Lane <[email protected]>
  parent: David G Johnston <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Tom Lane @ 2014-06-06 00:00 UTC (permalink / raw)
  To: David G Johnston <[email protected]>; +Cc: pgsql-docs

David G Johnston <[email protected]> writes:
> I simplified ". and bracket expressions" to "wildcard" and "^ and $" to
> "anchors" though did make use of ^ and $individual quite a bit.  I did not
> formally define these terms in the body either.

Did you mean to attach a proposed doc patch here, or are you just
armwaving about what a patch might look like?

FWIW, I don't agree with using "wildcard" to mean those particular things
(the term is too generic, and there are other regex constructs that
might be thought to be included); although you could probably get away
with using "anchor" this way as long as you define the term at first use.

The text involved here is more or less verbatim from Henry Spencer's
original man page for the regex library, so you're essentially claiming
you know more than the author did about what his code is good for.  Maybe
so, but some examples in support of your thesis would be a good thing.

> Instead of calling these "partial" and "inverse partial" better terms would
> be "newline-sensitive wildcard matching" and "newline-sensitive anchor
> matching".

Agreed that "partial" is not a very good name, but I remain resistant to
"wildcard" here.

> The default mode could be called "newline-sensitive full
> matching".

Or just "newline-sensitive matching" ... does "full" add anything?

			regards, tom lane

-- 
Sent via pgsql-docs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-docs

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Some qualms with the current description of RegExp s,n,w modes.
@ 2014-06-06 00:32  David Johnston <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: David Johnston @ 2014-06-06 00:32 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: pgsql-docs

On Thu, Jun 5, 2014 at 8:00 PM, Tom Lane <[email protected]> wrote:

> David G Johnston <[email protected]> writes:
> > I simplified ". and bracket expressions" to "wildcard" and "^ and $" to
> > "anchors" though did make use of ^ and $individual quite a bit.  I did
> not
> > formally define these terms in the body either.
>
> Did you mean to attach a proposed doc patch here, or are you just
> armwaving about what a patch might look like?
>

Armwaving for lack of any current setup to generate doc-patches.

> FWIW, I don't agree with using "wildcard" to mean those particular things
> (the term is too generic, and there are other regex constructs that
> might be thought to be included); although you could probably get away
> with using "anchor" this way as long as you define the term at first use.
>
>
I had the same nagging suspicion but figured for a first pass, and defined
only within this context, it would suffice.  ". and ^ brackets" just rubbed
me the wrong way but it does have the merit of being precise.

> The text involved here is more or less verbatim from Henry Spencer's
> original man page for the regex library, so you're essentially claiming
> you know more than the author did about what his code is good for.  Maybe
> so, but some examples in support of your thesis would be a good thing.
>

I can readily support why I found [w] to be most useful; the conclusion
that [w] > [s] came from the logic that making "^ and $" useless means that
using [w] mode and simply avoiding using them would have the same effect.
 I'll admit that people using ^ and $ where they really meant \A and \Z may
be an issue worth accounting for...but I personally call providing that
mode to be a compatibility/help-oriented decision and just decided to state
so in my revision.

Example that prompted this whole journey:

WITH src (filecontent) AS ( VALUES(
$$CDF      CORR: DRAIN COOLANT AND REFILL
         ADDITIONAL DLR-OP: BGFLDEX
         PAY TYPE: C         OTH HRS: 0000    FORECAST SERVICE:      CHG
TO:                 EPA CHG:           HAZ CHG:
         9999     5
         SPG CONVERSION SETTINGS - SPG MFG: --  GEN MOD: --  VIN/MODEL#:
            ENGINE:

CDR      CORR: CUSTOMER ELECTED NOT TO HAVE REPAIRS DONE AT THIS TIME
         NOS
         PAY TYPE: C         OTH HRS: 0000    FORECAST SERVICE:      CHG
TO:                 EPA CHG:           HAZ CHG:
         9999                                               03 0030
         SPG CONVERSION SETTINGS - SPG MFG: --  GEN MOD: --  VIN/MODEL#:
            ENGINE:
$$::varchar
))
, do_match AS (
SELECT regexp_matches(filecontent,'^(\S.*?)(?=^\S|\Z)','gw') AS match FROM
src
)
, explode_match AS (
SELECT unnest(match) FROM do_match
)
SELECT unnest, length(unnest) FROM explode_match;

[s] 1 result because the "^\S" construct attempts to match
beginning-of-document instead of beginning-of-line.  This is when I started
digging deeper since I expected it to behave like [w].
[n] 0 results because the (.*?) never gets beyond the first line and thus
cannot match "^\S|\Z" - no problem here, the behavior of "." is as expected.
[w] 2 results as desired/expected.  It is possible to replace ^\S with \n\S
(and thus allow [s] to work) but the semantic meaning of ^ makes using this
form more convenient

Note that CDF has 5 rows of content while CDR only has 4; thus strongly
suggesting the use of newline-insensitive "wildcard" matching.  The choice
of anchor mode is of a cosmetic/semantic nature but I argue that in this
situation the semantic of [w] are preferred over [n].

In either case I'd rather simply drop the existing commentary that [w] is
not that useful and either in words or example explain when it would have
use; even if you do not want to go as far as to claim that [w] is superior
to [n] as I would.

While it is likely possible to write a working expression in all three
modes my experience - which is largely based in executing these expressions
in Java, not PostgreSQL thought that is becoming more common nowadays - led
me directly to the regexp provided.

> > Instead of calling these "partial" and "inverse partial" better terms
> would
> > be "newline-sensitive wildcard matching" and "newline-sensitive anchor
> > matching".
>
> Agreed that "partial" is not a very good name, but I remain resistant to
> "wildcard" here.
>
> > The default mode could be called "newline-sensitive full
> > matching".
>
> Or just "newline-sensitive matching" ... does "full" add anything?
>
>
Not much - though after adding "anchor" and "wildcard" to the others the
question became if this option is not only one of those then is it both, or
neither?  Full makes it clear that it means both.

Maybe something like: [s] - single-line mode; [w] - multi-line mode; [n|m]
- document-only mode; though I dislike re-associating multi-line with [w]
given its current association with [n|m].  "Record Mode [w]" has some merit
since that is at least the use case that I have identified where it is
particularly useful...

David J.

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Some qualms with the current description of RegExp s,n,w modes.
@ 2014-06-06 00:56  David Johnston <[email protected]>
  parent: David Johnston <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: David Johnston @ 2014-06-06 00:56 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: pgsql-docs

>
>
>> Or just "newline-sensitive matching" ... does "full" add anything?
>>
>
And since I'm nit-picking anyway - the word "sensitive" does nothing for
me.  Simply "newline-matching" would be sufficient, ideally.  i.e., Do ".
[^]" and "^$" match the newline character, or not.

[w] anchor newline-matching
[n] dot/inverse-bracket newline-matching
[s] newline-matching

These are precise, what-oriented, names compared to:

[w] record mode
[n] multi-line mode
[s] single-line mode

which are more descriptive, use-oriented, names.

Use of these label sets is not mutually exclusive...

David J.

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

end of thread, other threads:[~2014-06-06 00:56 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2014-06-05 23:34 Some qualms with the current description of RegExp s,n,w modes. David G Johnston <[email protected]>
2014-06-06 00:00 ` Tom Lane <[email protected]>
2014-06-06 00:32   ` David Johnston <[email protected]>
2014-06-06 00:56     ` David Johnston <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox