X-Original-To: pgsql-general-postgresql.org@localhost.postgresql.org Received: from localhost (unknown [200.46.204.144]) by svr1.postgresql.org (Postfix) with ESMTP id E7C093A43EE for ; Wed, 24 Nov 2004 17:53:46 +0000 (GMT) Received: from svr1.postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 89900-07 for ; Wed, 24 Nov 2004 17:53:44 +0000 (GMT) Received: from mail.desc.org (splash195.drizzle.com [216.162.197.195]) by svr1.postgresql.org (Postfix) with ESMTP id 48E793A43D6 for ; Wed, 24 Nov 2004 17:53:43 +0000 (GMT) Received: from [192.168.6.220] (desktop2.desc.org [192.168.6.220]) by mail.desc.org (8.12.8/8.12.8) with ESMTP id iAOHrgUa030209 for ; Wed, 24 Nov 2004 09:53:42 -0800 Message-ID: <41A4CAA5.2040908@desc.org> Date: Wed, 24 Nov 2004 09:53:41 -0800 From: Ken Tanzer Organization: Downtown Emergency Service Center User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040618 X-Accept-Language: en-us, en MIME-Version: 1.0 To: pgsql-general@postgresql.org Subject: Re: Regexp matching: bug or operator error? References: <41A3C6C6.2090605@desc.org> <1713.1101254499@sss.pgh.pa.us> <1846.1101255318@sss.pgh.pa.us> In-Reply-To: <1846.1101255318@sss.pgh.pa.us> Content-Type: multipart/mixed; boundary="------------000206010403020901010307" X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Status: No, hits=0.0 tagged_above=0.0 required=5.0 tests=HTML_MESSAGE X-Spam-Level: X-Archive-Number: 200411/1227 X-Sequence-Number: 69307 This is a multi-part message in MIME format. --------------000206010403020901010307 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Thanks for the quick responses yesterday. At a minimum, it seems like this behavior does not match what is described in the Postgres documentation (more detail below). But I still have a hard time understanding the results of these two queries: select SUBSTRING( 'X444X','.*?([0-9]{1,3}).*?'); This is the original query I submitted, with the puzzling non-greedy match. It returns '4'. Adding start and end characters to the query, like so: select SUBSTRING( 'X444X','^.*?([0-9]{1,3}).*?$'); returns '444'. If the "whole RE" was being set non-greedy by the first ".*?", then shouldn't the subsequent "([0-9]{1,3})" also match non-greedily, returning a '4', with the last ".*?" then capturing the balance of "44X"? Either way, I'm not sure why the start and end characters would effect the rest of the match. In terms of the Postgres documentation, it definitely seems at odds with the observed behavior. Here's my attempts to explain why: a) select SUBSTRING( 'X444X','[0-9]{1,3}'); returns '444'. This suggests that a "default" for the {m,n} syntax is greedy. b) Table 9-13 of the docs describes {m,n} syntax, then lists {m,n}? as a "non-greedy" version of the same. That, and the fact that there doesn't seem to be a specific "greedy" modifier, would both also imply that {m,n} should be greedy. Section 9.6.3.5: "A quantified atom with other normal quantifiers (including {m,n} with m equal to n) prefers longest match" I can't find anything else in this section that would say otherwise. I specifically can't find anything that says the whole expression becomes greedy or not. If the regex code isn't going to change, it seems that changing the documentation would be very helpful to avoid confusion. Of course, that's easy for me to say, since I wouldn't have to do the work! For that matter, I'd be willing to try editing the documentation, but I'd have to understand what the actual behavior is before I could try to describe it! :) Either way, thanks for the great DB program! Ken Tanzer p.s., The suggested regex rewrites some people responded with were appreciated, but the regex I used was just a simplified example for this posting. Here's the actual regex we're working on--any help reformulating this would be great! select substring('Searching for log 5376, referenced in this text' FROM '(?i)(?:.*?)logs?(?:\\s|\\n|
|
| )(?:entry|no|number|#)?(?:\\s|\\n|
|
)?([0-9]{1,7})(.*?)'); We were able to get this to work by adding start and end characters, like so, but it doesn't seem like it should be necessary: select substring('Searching for log 5376, referenced in this text' FROM '(?i)^(?:.*?)logs?(?:\\s|\\n|
|
| )(?:entry|no|number|#)?(?:\\s|\\n\
|
)?([0-9]{1,7})(.*?)$'); Tom Lane wrote: >Checking in the Tcl bug tracker reveals that this is an open issue >for them as well: > >http://sourceforge.net/tracker/index.php?func=detail&aid=219219&group_id=10894&atid=110894 >http://sourceforge.net/tracker/index.php?func=detail&aid=219358&group_id=10894&atid=110894 > >The first entry has Henry Spencer claiming that it is operating as >designed, but the second one seems to cast doubt on that claim. >In any case I tend to agree that the notation implies that greediness >should be an independent property of each quantifier. > >However, if Henry can't or doesn't want to fix it, I'm not sure that >I care to wade in ;-) > > regards, tom lane > > --------------000206010403020901010307 Content-Type: text/x-vcard; charset=utf-8; name="ktanzer.vcf" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="ktanzer.vcf" begin:vcard fn:Kenneth Tanzer n:Tanzer;Kenneth org:Downtown Emergency Service Center;Information Services adr:;;507 Third Avenue;Seattle;WA;98104;USA email;internet:ktanzer@desc.org title:IS Manager tel;work:(206) 464-1570 x 3061 tel;fax:(206) 624-4196 url:http://www.desc.org version:2.1 end:vcard --------------000206010403020901010307--