[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] 16-bit chars rendered as "?" in UTF-8?


Subject: Re: [xsl] 16-bit chars rendered as "?" in UTF-8?
From: David Carlisle <davidc@xxxxxxxxx>
Date: Tue, 14 Aug 2012 12:42:30 +0100

On 14/08/2012 12:24, John English wrote:
On 13/08/2012 14:19, David Carlisle wrote:
Most likely reason is that either your input document or your
result document are being served with the wrong encoding. (ie the
encoding in the http header does not match the encoding in the
file)

Many thanks for this tip. The input was indeed ISO-8859-1 while the output was UTF-8. Changing the input encoding to UTF-8 fixed the problem. However, I still don't quite understand why this caused a problem, and if you have the time I'd be grateful for a brief explanation suited to a bear of vety little brain...


If the input is everywhere using numeric character references &#nnnn; then th einput document could be labeled with any ascii-compatible encoding (ascii, iso-8859-x, utf-8) and it would make no difference.
So at some point in the pipeline before it reaches the xslt stylesheet I think that character data has been entered or (equivalently) the file has been through an xml parser that has expanded the references.



A single piece of code loads a single stylesheet which is used to transform the input. In both case the input was encoded as ISO-8859-1 using entities "&#nnnn;" to represent the 16-bit characters using 8-bit characters only.

Your use of 16-bit characters and 8-bit characters here is I think the cause of the understanding. There is no such distinction.


Numeric references always refer to unicode code points (21 bit actually, going up to hex 10FFFF) the encoding specified in the http headers (and/or the <?xml encoding=... declaration in the input file)
refers to the encoding used for character data. Most (but not all) encodings encode ascii characters that is the _7_ bit range in the same way. Although XML does not require that (UTF-16 for example takes two bytes to encode ascii characters) and EBCDIC encodings use one byte but different values.


If you take an "8 bit" character such as e-acute U+00E9 decimal 233
then in iso-8859-1 that is encoded as a single byte with value 233 but in utf-8 it is encoded as two bytes with hex values C3 A9 So if the input parser is expecting to see a UTF-8 stream and it sees the byte E9 then that is a syntax error. It probably pught to stop at that point with a fatal parse error but you seem to be using a system that recovers by replacing the undecodable byte by a ?



In both cases the output is
UTF-8 (as defined in the stylesheet) but in one case the entities are
transformed into the corresponding 16-bit characters "W" and so on,
while in the other case they are transformed into question marks "?",
character 0x3F. What I don't understand is why this should happen
when both cases are dealt with by the same code and stylesheet?

Again, many thanks,


David

--
google plus: https:/profiles.google.com/d.p.carlisle

________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. ________________________________________________________________________



Current Thread
Keywords