Bug in XPath 2.0 regular expression search

Having trouble installing Oxygen? Got a bug to report? Post it all here.
dsewell
Posts: 125
Joined: Mon Jun 09, 2003 6:02 pm
Location: Charlottesville, Virginia USA

Bug in XPath 2.0 regular expression search

Post by dsewell »

Consider this XML document:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<test>
<string>A dog</string>
<string>a cat</string>
</test>
I want to match the text node containing ASCII 'A' using an XPath 2.0 expression in the oXygen 6.1 XPath search box. This works:

Code: Select all

//text()[matches(., "A")]
but this returns 0 results:

Code: Select all

//text()[matches(., "&#65;")]
even though it is semantically identical according to XPath 2 / XML schema regular expression rules.

Instead, the second search in oXygen matches this:

Code: Select all

<string>&#65; frog</string>
i.e. a string with an ampersand character reference followed by literal "#65;". It seems that oXygen is not correctly parsing the character reference before doing the regular expression match.

I don't know if oXygen passes an XPath 2.0 search to the Saxon 8 engine, but if so, the problem is not with Saxon 8, because it performs correctly given this XQuery using the same matches() call:

Code: Select all

let $xml := 
<test>
<string>A dog</string>
<string>a cat</string>
</test>
for $n in $xml//string[matches(., "&#65;")]
return $n

(: returns <string>A dog</string> :)
george
Site Admin
Posts: 2095
Joined: Thu Jan 09, 2003 2:58 pm

Post by george »

Hi David,

The XPath entry box is an input field and we get its content as the XPath expression to be executed, that is we do not consider that as an XML fragment thus no decoding of entities is performed.
Do you have problems with this?

Best Regards,
George
dsewell
Posts: 125
Joined: Mon Jun 09, 2003 6:02 pm
Location: Charlottesville, Virginia USA

XPath syntax

Post by dsewell »

Well, I think that technically according to the XPath specification (even for XPath 1.0) a string is a sequence of characters as defined in the XML specification (see http://www.w3.org/TR/xpath#strings), so that in fact an XPath parser should treat these as identical:

Code: Select all

contains("CAT", "A")   =  contains("CAT", "&#65;")
contains("dog's", "'") = contains("dog's", "&apos;")
So the current oXygen XPath search not fully implementing the XPath string model, as I understand it.

For my personal work it's not a big issue because I can almost always directly input a UTF-8 character. But I discovered this bug when I was documenting a procedure for general use. Specifically, I was sharing a method in oXygen for searching for Unicode en dash (—). It is preferable to use a numeric character reference like "contains($string, '&#8212;')" because it is too easy to confuse the en-dash character with a hyphen. So I do think it would be worth adding support for character entity references in the search field.
george
Site Admin
Posts: 2095
Joined: Thu Jan 09, 2003 2:58 pm

Post by george »

Hi David,

Not really, the XPath is in general placed in an attribute value and it is the XML Parser that decodes the entity. But we will consider anyway adding an option that when enabled will produce decoding of the standard XML entities <, >, &apos;, " and & and of the character entities from the XPath entry field.

Best Regards,
George
dsewell
Posts: 125
Joined: Mon Jun 09, 2003 6:02 pm
Location: Charlottesville, Virginia USA

Post by dsewell »

george wrote:we will consider anyway adding an option that when enabled will produce decoding of the standard XML entities <, >, &apos;, " and & and of the character entities from the XPath entry field.
If that's not too much trouble, I think it would be a helpful feature,

David
Post Reply