non-greedy reg-ex broken?

Questions about XML that are not covered by the other forums should go here.
sderrick
Posts: 211

non-greedy reg-ex broken?

Thu Mar 21, 2013 12:12 am

Linux 14.1 Editor

using "<emph ana="italic">(.*?)</emph>," as my search string

on

"<emph ana="italic">is</emph> my sister? so I might have taken her to me to wife: now therefore behold thy wife, take <emph ana="italic">her</emph>,"

I would expect it to return "<emph ana="italic">her</emph>," as the result.

However it returns the entire string? That seems wrong?

Scott
adrian
Posts: 2442

Re: non-greedy reg-ex broken?

Thu Mar 21, 2013 11:38 am

Hello,

No, it's actually correct.
I believe you are interpreting the non-greedy reg-ex as a "shortest match" (ignoring other longer matches), but it doesn't work that way. It returns the first encountered match. It doesn't skip matches to search for the shortest, it simply stops at the first encountered match.
The greedy match (.*) works similarly, but after finding a match, it tries to extend it to the right as much as possible.

So, in short, this depends a lot on the position where you start the search.
If you start from the left edge, it will match the entire string even with non-greedy. If you move the start position even a character to the right, it will return the last match (as you were expecting).

You're usually better off avoiding the dot (".") in such reg-exp expressions, because it makes the expression a lot greedier than you expect (even when non-greedy).

In this case, if you don't want it to match other tags in between the empth tags, you could use "[^<]" (any character other than "<") instead of ".".
e.g.

Code: Select all

<emph ana="italic">([^<]*?)</emph>,

Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
sderrick
Posts: 211

Re: non-greedy reg-ex broken?

Thu Mar 21, 2013 4:51 pm

Apparently the whole world(every site I checked) disagrees with your interpretation of non-greedy expressions.

As I stated, a <tag>.*?</tag> should only match the first pair or tags in the string

see

http://www.regular-expressions.info/repeat.html

http://www.itworld.com/nl/perl/01112001

http://www.asiteaboutnothing.net/regex/regex-greed.html

I think, and the world agrees, your reg-ex engine is non-compliant and broken in regards to non-greedy searches.
adrian
Posts: 2442

Re: non-greedy reg-ex broken?

Thu Mar 21, 2013 5:22 pm

Hi,
Apparently the whole world(every site I checked) disagrees with your interpretation of non-greedy expressions.

As I stated, a <tag>.*?</tag> should only match the first pair or tags in the string
But you didn't search for a simple pair of tags, you searched for a pair of tags followed by a comma (unless that was a spelling error):
using "<emph ana="italic">(.*?)</emph>," as my search string
That comma at the end of the reg-ex only appears in the searched content after the last tag:
"<emph ana="italic">is</emph> my sister? so I might have taken her to me to wife: now therefore behold thy wife, take <emph ana="italic">her</emph>,"
So the reg-ex matcher finds the first "<emph ana="italic">" at the beginning and the only "</emph>," (note the comma) at the end of the searched content.

However, if you search for a pair of tags:

Code: Select all

<emph ana="italic">(.*?)</emph>
without the trailing comma, it will find the first pair: "<emph ana="italic">is</emph>" and if you search again, it will find the second pair: "<emph ana="italic">her</emph>"

That comma from the reg-ex makes a big difference in this regard.

Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
sderrick
Posts: 211

Re: non-greedy reg-ex broken?

Fri Mar 22, 2013 6:20 pm

Got it! Sorry for being such a dufus! :shock:

Return to “General XML Questions”

Who is online

Users browsing this forum: No registered users and 1 guest