Regular expressions discrepancy

Having trouble installing Oxygen? Got a bug to report? Post it all here.
whyme
Posts: 89
Joined: Fri Mar 08, 2013 8:58 am

Regular expressions discrepancy

Post by whyme »

Suppose a string, "|". When I search that string for a match against \W, there is no match in the context of operation using an XPath expression (say a stylesheet). But there is a match when I search in the oXygen Find/Replace dialog box (i.e. Find: \W).

As far as I can tell from the official definition, http://www.w3.org/TR/xmlschema-2/#charcter-classes, the XPath is right, since U+00C7 is tagged as Sm, which is not excluded from the class \w.

I would think that the oXygen search mechanism is wrong, or else the departure from the W3C definition is intentional, but not documented in the right place, i.e., http://www.oxygenxml.com/doc/versions/1 ... sions.html. Or am I off somewhere?
adrian
Posts: 2855
Joined: Tue May 17, 2005 4:01 pm

Re: Regular expressions discrepancy

Post by adrian »

Hi,

Yes, that is correct, the regular expression syntax accepted in XPath/XML Schema/Schematron is slightly different than the one used by Oxygen in the text searches from the Find/Replace dialogs.
Oxygen uses in the the Find/Replace dialogs Java regular expression syntax, which is based on Perl 5 regex, with some differences:
https://www.oxygenxml.com/doc/versions/ ... sions.html
Click on the link at the bottom and you should find on that page the definition of \w and \W:
http://docs.oracle.com/javase/6/docs/ap ... tml#predef

Code: Select all

\w 	A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
The regular expression syntax from XPath functions is also based on Perl syntax, but it uses the XML Schema regex as the base, so there are some differences:
http://www.w3.org/TR/xpath-functions/#regex-syntax
This is the XML Schema "Regular Expressions" glossary:
http://www.w3.org/TR/xmlschema-2/#dt-ccesN
(look above the G Glossary for the definition of \W and \w)

Code: Select all

\w	[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)
\W [^\w]
So, in short, \w and \W are significantly different in the two implementations. If you want to have consistent results between the two, you should use in Oxygen [\p{P}\p{Z}\p{C}] instead of \W.

Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
whyme
Posts: 89
Joined: Fri Mar 08, 2013 8:58 am

Re: Regular expressions discrepancy

Post by whyme »

Thank you for the thorough background. Would you be willing, next time you update the documentation, to include a modified form of this discussion in the material about regular expressions? The escape classes \w and \W are widely used, and a bit more prominence to the issue in oXygen documentation would be helpful to users. Thanks!
adrian
Posts: 2855
Joined: Tue May 17, 2005 4:01 pm

Re: Regular expressions discrepancy

Post by adrian »

Hi,

I've already submitted an issue for our documentation department to include this in the manual. I forgot to mention this in my previous post.

Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Post Reply