[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Saxon and ZWNJ

Subject: Re: [xsl] Saxon and ZWNJ
From: Mohsen Saboorian <mohsens@xxxxxxxxx>
Date: Mon, 10 Jun 2013 13:03:29 +0430

Sorry, this was related to my underlying HTML cleaner engine (which
provides HTML => valid DOM 3). I upgraded from htmlcleaner-2.2 to
htmlcleaner-2.5 and this escaping issue happened. I just downgraded
and this was resolved.


On Mon, Jun 10, 2013 at 11:28 AM, Michael Kay <mike@xxxxxxxxxxxx> wrote:
> Yes, I think it's a bug -- but not in Saxon.
> Saxon's implementation of XdmItem.getStringValue() relies on calling textNode.getNodeValue() in the underlying DOM, and my suspicion is that this method is returning the value of the text node in escaped form.
> What exactly is this "HTML cleaned DOM" that you are passing to the DOMSource constructor? If my suspicion is correct, it doesn't implement the DOM spec correctly.
> Michael Kay
> Saxonica
> PS: this question is very product specific. Product-specific questions are better addressed to a product-specific forum rather than to the xsl-list. For Saxon, you can use the forums at saxonica.plan.io
> On 9 Jun 2013, at 22:42, Mohsen Saboorian wrote:
>> Hi,
>> I'm trying to evaluate an XPATH expression with saxon- using
>> the following code snippet:
>>  Configuration conf = new Configuration();
>>  conf.setValidation(false);
>>  Processor p = new Processor(false);
>>  DocumentBuilder documentBuilder = p.newDocumentBuilder();
>>  XPathCompiler xpathCompiler = p.newXPathCompiler();
>>  XPathExecutable xpe = xpathCompiler.compile(expression);
>>  XPathSelector xpath = xpe.load();
>>  xpath.setContextItem(documentBuilder.build(new
>> DOMSource(cleanHtml.document)));
>>  XdmItem result = xpath.evaluateSingle();
>> The HTML is in Persian script (whose cleaned DOM is passed as
>> cleanHtml.document in the above code) which has ZWNJ (U+200C) not
>> escaped.
>> The matched XdmItem has ZWNJ (U+200C) (non-escaped) but when obtaining
>> result.getStringValue(), the result has escaped ZWNJ as (&zwnj;) which
>> doesn't seem to be correct because I'm getting node 'string' value.
>> Is this a bug, or is there any flag to disable escaping special
>> Unicode characters in saxon?
>> Regards,
>> Mohsen

Current Thread