Page 1 of 1
					
				Regular expressions discrepancy
				Posted: Fri Oct 30, 2015 1:29 am
				by whyme
				Suppose a string, "|". When I search that string for a match against \W, there is no match in the context of operation using an XPath expression (say a stylesheet). But there is a match when I search in the oXygen Find/Replace dialog box (i.e. Find: \W).
As far as I can tell from the official definition, 
http://www.w3.org/TR/xmlschema-2/#charcter-classes, the XPath is right, since U+00C7 is tagged as Sm, which is not excluded from the class \w. 
I would think that the oXygen search mechanism is wrong, or else the departure from the W3C definition is intentional, but not documented in the right place, i.e., 
http://www.oxygenxml.com/doc/versions/1 ... sions.html. Or am I off somewhere?
 
			 
			
					
				Re: Regular expressions discrepancy
				Posted: Fri Oct 30, 2015 5:34 pm
				by adrian
				Hi,
Yes, that is correct, the regular expression syntax accepted in XPath/XML Schema/Schematron is slightly different than the one used by Oxygen in the text searches from the Find/Replace dialogs.
Oxygen uses in the the Find/Replace dialogs Java regular expression syntax, which is based on Perl 5 regex, with some differences:
https://www.oxygenxml.com/doc/versions/ ... sions.html
Click on the link at the bottom and you should find on that page the definition of \w and \W:
http://docs.oracle.com/javase/6/docs/ap ... tml#predef
Code: Select all
\w 	A word character: [a-zA-Z_0-9]
\W 	A non-word character: [^\w]
The regular expression syntax from XPath functions is also based on Perl syntax, but it uses the XML Schema regex as the base, so there are some differences:
http://www.w3.org/TR/xpath-functions/#regex-syntax
This is the XML Schema "Regular Expressions" glossary:
http://www.w3.org/TR/xmlschema-2/#dt-ccesN
(look above the G Glossary for the definition of \W and \w)
Code: Select all
\w	[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)
\W	[^\w]
So, in short, \w and \W are significantly different in the two implementations. If you want to have consistent results between the two, you should use in Oxygen [\p{P}\p{Z}\p{C}] instead of \W.
Regards,
Adrian
 
			 
			
					
				Re: Regular expressions discrepancy
				Posted: Mon Nov 02, 2015 5:28 pm
				by whyme
				Thank you for the thorough background. Would you be willing, next time you update the documentation, to include a modified form of this discussion in the material about regular expressions? The escape classes \w and \W are widely used, and a bit more prominence to the issue in oXygen documentation would be helpful to users. Thanks!
			 
			
					
				Re: Regular expressions discrepancy
				Posted: Mon Nov 02, 2015 5:46 pm
				by adrian
				Hi,
I've already submitted an issue for our documentation department to include this in the manual. I forgot to mention this in my previous post.
Regards,
Adrian