non-greedy reg-ex broken?
Questions about XML that are not covered by the other forums should go here.
-
- Posts: 269
- Joined: Sat Jul 10, 2010 4:03 pm
non-greedy reg-ex broken?
Linux 14.1 Editor
using "<emph ana="italic">(.*?)</emph>," as my search string
on
"<emph ana="italic">is</emph> my sister? so I might have taken her to me to wife: now therefore behold thy wife, take <emph ana="italic">her</emph>,"
I would expect it to return "<emph ana="italic">her</emph>," as the result.
However it returns the entire string? That seems wrong?
Scott
using "<emph ana="italic">(.*?)</emph>," as my search string
on
"<emph ana="italic">is</emph> my sister? so I might have taken her to me to wife: now therefore behold thy wife, take <emph ana="italic">her</emph>,"
I would expect it to return "<emph ana="italic">her</emph>," as the result.
However it returns the entire string? That seems wrong?
Scott
-
- Posts: 2879
- Joined: Tue May 17, 2005 4:01 pm
Re: non-greedy reg-ex broken?
Hello,
No, it's actually correct.
I believe you are interpreting the non-greedy reg-ex as a "shortest match" (ignoring other longer matches), but it doesn't work that way. It returns the first encountered match. It doesn't skip matches to search for the shortest, it simply stops at the first encountered match.
The greedy match (.*) works similarly, but after finding a match, it tries to extend it to the right as much as possible.
So, in short, this depends a lot on the position where you start the search.
If you start from the left edge, it will match the entire string even with non-greedy. If you move the start position even a character to the right, it will return the last match (as you were expecting).
You're usually better off avoiding the dot (".") in such reg-exp expressions, because it makes the expression a lot greedier than you expect (even when non-greedy).
In this case, if you don't want it to match other tags in between the empth tags, you could use "[^<]" (any character other than "<") instead of ".".
e.g.
Regards,
Adrian
No, it's actually correct.
I believe you are interpreting the non-greedy reg-ex as a "shortest match" (ignoring other longer matches), but it doesn't work that way. It returns the first encountered match. It doesn't skip matches to search for the shortest, it simply stops at the first encountered match.
The greedy match (.*) works similarly, but after finding a match, it tries to extend it to the right as much as possible.
So, in short, this depends a lot on the position where you start the search.
If you start from the left edge, it will match the entire string even with non-greedy. If you move the start position even a character to the right, it will return the last match (as you were expecting).
You're usually better off avoiding the dot (".") in such reg-exp expressions, because it makes the expression a lot greedier than you expect (even when non-greedy).
In this case, if you don't want it to match other tags in between the empth tags, you could use "[^<]" (any character other than "<") instead of ".".
e.g.
Code: Select all
<emph ana="italic">([^<]*?)</emph>,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
-
- Posts: 269
- Joined: Sat Jul 10, 2010 4:03 pm
Re: non-greedy reg-ex broken?
Apparently the whole world(every site I checked) disagrees with your interpretation of non-greedy expressions.
As I stated, a <tag>.*?</tag> should only match the first pair or tags in the string
see
http://www.regular-expressions.info/repeat.html
http://www.itworld.com/nl/perl/01112001
http://www.asiteaboutnothing.net/regex/regex-greed.html
I think, and the world agrees, your reg-ex engine is non-compliant and broken in regards to non-greedy searches.
As I stated, a <tag>.*?</tag> should only match the first pair or tags in the string
see
http://www.regular-expressions.info/repeat.html
http://www.itworld.com/nl/perl/01112001
http://www.asiteaboutnothing.net/regex/regex-greed.html
I think, and the world agrees, your reg-ex engine is non-compliant and broken in regards to non-greedy searches.
-
- Posts: 2879
- Joined: Tue May 17, 2005 4:01 pm
Re: non-greedy reg-ex broken?
Hi,
However, if you search for a pair of tags: without the trailing comma, it will find the first pair: "<emph ana="italic">is</emph>" and if you search again, it will find the second pair: "<emph ana="italic">her</emph>"
That comma from the reg-ex makes a big difference in this regard.
Regards,
Adrian
But you didn't search for a simple pair of tags, you searched for a pair of tags followed by a comma (unless that was a spelling error):Apparently the whole world(every site I checked) disagrees with your interpretation of non-greedy expressions.
As I stated, a <tag>.*?</tag> should only match the first pair or tags in the string
That comma at the end of the reg-ex only appears in the searched content after the last tag:using "<emph ana="italic">(.*?)</emph>," as my search string
So the reg-ex matcher finds the first "<emph ana="italic">" at the beginning and the only "</emph>," (note the comma) at the end of the searched content."<emph ana="italic">is</emph> my sister? so I might have taken her to me to wife: now therefore behold thy wife, take <emph ana="italic">her</emph>,"
However, if you search for a pair of tags:
Code: Select all
<emph ana="italic">(.*?)</emph>
That comma from the reg-ex makes a big difference in this regard.
Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Return to “General XML Questions”
Jump to
- Oxygen XML Editor/Author/Developer
- ↳ Feature Request
- ↳ Common Problems
- ↳ DITA (Editing and Publishing DITA Content)
- ↳ SDK-API, Frameworks - Document Types
- ↳ DocBook
- ↳ TEI
- ↳ XHTML
- ↳ Other Issues
- Oxygen XML Web Author
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Content Fusion
- ↳ Feature Request
- ↳ Common Problems
- Oxygen JSON Editor
- ↳ Feature Request
- ↳ Common Problems
- Oxygen PDF Chemistry
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Feedback
- ↳ Feature Request
- ↳ Common Problems
- Oxygen XML WebHelp
- ↳ Feature Request
- ↳ Common Problems
- XML
- ↳ General XML Questions
- ↳ XSLT and FOP
- ↳ XML Schemas
- ↳ XQuery
- NVDL
- ↳ General NVDL Issues
- ↳ oNVDL Related Issues
- XML Services Market
- ↳ Offer a Service