Page 1 of 1

non-greedy reg-ex broken?

Posted: Thu Mar 21, 2013 12:12 am
by sderrick
Linux 14.1 Editor

using "<emph ana="italic">(.*?)</emph>," as my search string

on

"<emph ana="italic">is</emph> my sister? so I might have taken her to me to wife: now therefore behold thy wife, take <emph ana="italic">her</emph>,"

I would expect it to return "<emph ana="italic">her</emph>," as the result.

However it returns the entire string? That seems wrong?

Scott

Re: non-greedy reg-ex broken?

Posted: Thu Mar 21, 2013 11:38 am
by adrian
Hello,

No, it's actually correct.
I believe you are interpreting the non-greedy reg-ex as a "shortest match" (ignoring other longer matches), but it doesn't work that way. It returns the first encountered match. It doesn't skip matches to search for the shortest, it simply stops at the first encountered match.
The greedy match (.*) works similarly, but after finding a match, it tries to extend it to the right as much as possible.

So, in short, this depends a lot on the position where you start the search.
If you start from the left edge, it will match the entire string even with non-greedy. If you move the start position even a character to the right, it will return the last match (as you were expecting).

You're usually better off avoiding the dot (".") in such reg-exp expressions, because it makes the expression a lot greedier than you expect (even when non-greedy).

In this case, if you don't want it to match other tags in between the empth tags, you could use "[^<]" (any character other than "<") instead of ".".
e.g.

Code: Select all

<emph ana="italic">([^<]*?)</emph>,
Regards,
Adrian

Re: non-greedy reg-ex broken?

Posted: Thu Mar 21, 2013 4:51 pm
by sderrick
Apparently the whole world(every site I checked) disagrees with your interpretation of non-greedy expressions.

As I stated, a <tag>.*?</tag> should only match the first pair or tags in the string

see

http://www.regular-expressions.info/repeat.html

http://www.itworld.com/nl/perl/01112001

http://www.asiteaboutnothing.net/regex/regex-greed.html

I think, and the world agrees, your reg-ex engine is non-compliant and broken in regards to non-greedy searches.

Re: non-greedy reg-ex broken?

Posted: Thu Mar 21, 2013 5:22 pm
by adrian
Hi,
Apparently the whole world(every site I checked) disagrees with your interpretation of non-greedy expressions.

As I stated, a <tag>.*?</tag> should only match the first pair or tags in the string
But you didn't search for a simple pair of tags, you searched for a pair of tags followed by a comma (unless that was a spelling error):
using "<emph ana="italic">(.*?)</emph>," as my search string
That comma at the end of the reg-ex only appears in the searched content after the last tag:
"<emph ana="italic">is</emph> my sister? so I might have taken her to me to wife: now therefore behold thy wife, take <emph ana="italic">her</emph>,"
So the reg-ex matcher finds the first "<emph ana="italic">" at the beginning and the only "</emph>," (note the comma) at the end of the searched content.

However, if you search for a pair of tags:

Code: Select all

<emph ana="italic">(.*?)</emph>
without the trailing comma, it will find the first pair: "<emph ana="italic">is</emph>" and if you search again, it will find the second pair: "<emph ana="italic">her</emph>"

That comma from the reg-ex makes a big difference in this regard.

Regards,
Adrian

Re: non-greedy reg-ex broken?

Posted: Fri Mar 22, 2013 6:20 pm
by sderrick
Got it! Sorry for being such a dufus! :shock: