[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16


Subject: Re: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Fri, 13 Oct 2006 02:28:45 +0200

Hi Michael,

Once more, many thanks for a quick and thorough reply. You must have 48 hours in a day ;-)

Please see my additions below

Michael Kay wrote:
Actually, that's up to you: in Saxon it's configurable whether this error is
recovered or not.

Thanks, I'll look it up, it may come in very handy. I wonder though what the result will be when calling it on a node set, like this:
<xsl:copy-of select="document($configuration//resource/@url)" />


when some of the @url are not pointing to valid resources.
You're being a bit too concise hear, it's not clear to me what's going on.

I see. Actually, I was trying to keep my story to a readable size. But to get to the actual problem, I think this more clearly describes it (a test file, I used it find out what was going on):


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
xmlns:xs = "http://www.w3.org/2001/XMLSchema"
xmlns:local="urn:local"
xml:base="encoding/"
exclude-result-prefixes="#all">
<xsl:output indent="yes" />
<xsl:template match="/">
<xsl:copy-of select="local:getfile('testUTF8.xml', ())" />
<xsl:copy-of select="local:getfile('testUTF16.xml', ())" />
<xsl:copy-of select="local:getfile('testUTF8-with-16-in-prolog.xml', ())" />
<xsl:copy-of select="local:getfile('testUTF16-with-8-in-prolog.xml', ())" />
<xsl:copy-of select="local:getfile('testUTF8.xml', 'utf-16')" />
<xsl:copy-of select="local:getfile('testUTF16.xml', 'utf-8')" />
<xsl:copy-of select="local:getfile('testUTF8-with-16-in-prolog.xml', 'utf-8')" />
<xsl:copy-of select="local:getfile('testUTF16-with-8-in-prolog.xml', 'utf-16')" />
<xsl:value-of select="unparsed-text('testUTF8.xml')" />
<xsl:value-of select="unparsed-text('testUTF16.xml')" />
<xsl:copy-of select="document('testUTF8-with-16-in-prolog.xml')" />
<xsl:copy-of select="document('testUTF16-with-8-in-prolog.xml')" />
</xsl:template>
<xsl:function name="local:getfile">
<xsl:param name="filename" />
<xsl:param name="encoding" />
<xsl:variable name="fileinfo">
<xsl:variable name="unp-available" as="xs:boolean"
select="if(empty($encoding))
then unparsed-text-available($filename)
else unparsed-text-available($filename, $encoding)" />
<available><xsl:value-of select="$unp-available" /></available>
<content>
<xsl:if test="$unp-available">
<xsl:value-of select="
if(empty($encoding))
then normalize-space(unparsed-text($filename))
else normalize-space(unparsed-text($filename, $encoding))" />
</xsl:if>
</content>
</xsl:variable>
<file-info
req-enc="{$encoding}"
doc-avail="{doc-available($filename)}"
unp-text-avail="{$fileinfo/available}"
filename="{$filename}">
<xsl:text>&#x0A;</xsl:text>
<unparsed-content>
<xsl:value-of select="$fileinfo/content"/>
</unparsed-content>
<xsl:text>&#x0A;</xsl:text>
</file-info>
</xsl:function>
</xsl:stylesheet>



In fact, this runs a series of tests. The last for value-of/copy-of may fail and throw an error. The first eight or so (with getfile) should never throw an error.


This is what actually happened this morning: one of our programmers was testing and suddenly had the export part of the system producing nothing, or inconclusive results. After quite a while, we found out the the XML Serialization method that we used, was changed. We switched to DOM L3, using latest version of Xerces, to make use of the new Load/Save additions (which in turn was done to get rid of the rather ridiculous way of dealing with namespace-serialization when we serialized it the old way).

The output file looked perfect and as such had a header not unfamiliar to people with some XML knowledge. The content was perfectly well readable, too:
<?xml version="1.0" encoding="UTF-16"?>


BUT! (after more research) The application failed because this file was actually serialized to disk using UTF-8. So, the XML Prolog and the actual encoding did not match.

Because we already used unparsed-text(), I was very surprised to find out that that particular function tried to read it as XML. But, reading the specs, that was so required. But it yielded some unexpected results.

If you would like to do the tests yourself, it is easy enough to create the malformed XML files. Just create the normal way. Open in a Unicode aware editor (NOT an XML aware editor!) and save as UTF-8 when UTF-16 is in the prolog and vice versa (or I can send the malformed set and/or place it online).

I think you're trying to abbreviate the text, which is reasonable, but in
doing so you've misrepresented it.

Well, actually, I wanted to simplify the discussion. But, reading on on your comments, I understand that that's not feasible in this context.


Rule 1 (which is actually rule 2 in the
spec) does not say "if you can read it as XML", it says "if the media type
of the resource is text/xml or application/xml (see [RFC2376]), or if it
matches the conventions text/*+xml or application/*+xml (see [RFC3023]
and/or its successors)". So if an HTTP server serves up a non-XML document
with an application/xml media type, this rule is going to kick in.

Well, no HTTP server here. Just a text file read in from disk. But I guess the media types as registered by the operating system also count?


My question: why is the file read in the encoding specified in the
(illegal) XML file?

If you're asking why the spec is as it is, the answer is (a) for
compatibility with XInclude, and (b) for use cases where you want to use
unparsed-text() to read XML/XHTML without parsing it.

No, I am basically trying to find out what to expect in this odd circumstance (and since users may add there own files, it very well may happen again that this scenario appears)


Firstly, there's been one late change to the spec in this area. It's now
more permissive, it allows the processor to try harder.

I didn't read that from the specs. But then, I still haven't read every corner of it ;-)


4. [new] the processor may use implementation-defined heuristics to
determine the encoding, otherwise

I looked here: http://www.w3.org/TR/xslt20/#unparsed-text and it's not yet added. So, this is *very* hot of the press?


It's true that Saxon doesn't currently implement this quite as written.

Lucky, lucky me in this scenario, but I am afraid you will change that later, so I can't (and should not) rely on that.


The
sequence currently followed by Saxon is in essence:

1. the value of the $encoding argument is used if present, otherwise
Checked

2. if the file is being read using the HTTP protocol, get the encoding
from the HTTP headers, otherwise
Can't verify. I use a URI (was earlier discussion) with 'file://' etc. So no HTTP here.

3. read the beginning of the file:

3a: if there's a UTF-16 or UTF-8 byte order mark, assume it's correct,
otherwise
Checked: no BOM.for either crapped file (resp. 3C, 3F and 00 3C 00 3F, which is standard start of '<?' in either encoding)

3b: if there's something that looks like an [ASCII] XML declaration
with an encoding attribute, use that
This is where things go wrong, I think. It appears as if Saxon indeed finds the XML declaration in either file, and uses it. To my surprise, it does not check the result of this, which is illegal XML (it tries to read an UTF-8 encoded file as UTF-16 because the (utf8/ascii) XML declaration contains UTF-16. Which cannot be correct if you can read the prolog as UTF-8. These two exclude each other. So, 3b is only partially feasible, I think.

3c: if the first four even-numbered bytes are zero, assume UTF-16BE
That's my case in the other scenario, where a UTF-16 file has an XML declaration with UTF-8 in it. Again (see before) this poses a controversy: the XML declaration cannot be read as UTF-16, contain (in UTF-16) a declaration with UTF-8 and then suddenly be UTF-8.

3d: if the first four odd-numbered bytes are zero, assume UTF-16LE

4. otherwise assume UTF-8.
It never gets here if the xml declaration is misformed.

I don't really understand all the background here, but it's all to do with
browser history: popular browsers try to outguess the HTTP headers, and the
specs disapprove, and W3C is trying to hold its ground in the battle. It's a
bigger issue than XSLT, in other words.

sounds like a lot of politics to me. Never new that such a tiny thing could come from such a huge factor ;-)



Thanks and cheers,


-- Abel Braaksma
  http://www.nuntia.com


Current Thread