Bug in XPath 2.0 regular expression search

Post by **dsewell** » Fri Jul 29, 2005 3:57 am

Consider this XML document:

<?xml version="1.0" encoding="UTF-8"?>

<test>

   <string>A dog</string>

   <string>a cat</string>

 </test>

I want to match the text node containing ASCII 'A' using an XPath 2.0 expression in the oXygen 6.1 XPath search box. This works:

Code: Select all

//text()[matches(., "A")]

but this returns 0 results:

Code: Select all

//text()[matches(., "&#65;")]

even though it is semantically identical according to XPath 2 / XML schema regular expression rules.

Instead, the second search in oXygen matches this:

Code: Select all

<string>&#65; frog</string>

i.e. a string with an ampersand character reference followed by literal "#65;". It seems that oXygen is not correctly parsing the character reference before doing the regular expression match.

I don't know if oXygen passes an XPath 2.0 search to the Saxon 8 engine, but if so, the problem is not with Saxon 8, because it performs correctly given this XQuery using the same matches() call:

Code: Select all

let $xml := 

<test>

   <string>A dog</string>

   <string>a cat</string>

</test>

for $n in $xml//string[matches(., "&#65;")]

return $n



(: returns <string>A dog</string> :)

Post by **george** » Fri Jul 29, 2005 6:05 pm

Hi David,

The XPath entry box is an input field and we get its content as the XPath expression to be executed, that is we do not consider that as an XML fragment thus no decoding of entities is performed.
Do you have problems with this?

Best Regards,
George

Post by **dsewell** » Fri Jul 29, 2005 6:52 pm

Well, I think that technically according to the XPath specification (even for XPath 1.0) a string is a sequence of characters as defined in the XML specification (see http://www.w3.org/TR/xpath#strings), so that in fact an XPath parser should treat these as identical:

Code: Select all

contains("CAT", "A")   =  contains("CAT", "&#65;")

contains("dog's", "'")  = contains("dog's", "&apos;")

So the current oXygen XPath search not fully implementing the XPath string model, as I understand it.

For my personal work it's not a big issue because I can almost always directly input a UTF-8 character. But I discovered this bug when I was documenting a procedure for general use. Specifically, I was sharing a method in oXygen for searching for Unicode en dash (—). It is preferable to use a numeric character reference like "contains($string, '—')" because it is too easy to confuse the en-dash character with a hyphen. So I do think it would be worth adding support for character entity references in the search field.

Post by **george** » Fri Jul 29, 2005 7:23 pm

Hi David,

Not really, the XPath is in general placed in an attribute value and it is the XML Parser that decodes the entity. But we will consider anyway adding an option that when enabled will produce decoding of the standard XML entities <, >, ', " and & and of the character entities from the XPath entry field.

Best Regards,
George

Post by **dsewell** » Sat Jul 30, 2005 2:37 am

george wrote:we will consider anyway adding an option that when enabled will produce decoding of the standard XML entities <, >, ', " and & and of the character entities from the XPath entry field.

If that's not too much trouble, I think it would be a helpful feature,

David

Bug in XPath 2.0 regular expression search

Bug in XPath 2.0 regular expression search

XPath syntax