[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Trouble with special characters

Subject: Re: [xsl] Trouble with special characters
From: "Eliot Kimber ekimber@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 25 Jan 2016 22:45:34 -0000

If you are working in Java, be sure that anywhere you are going from bytes
to characters that you are specifying the encoding explicitly and that if
you are generating XML with an encoding declaration that it matches the
encoding you're writing.


Eliot Kimber, Owner
Contrext, LLC

On 1/25/16, 2:48 PM, "a kusa akusa8@xxxxxxxxx"
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>Thanks a lot for taking the time to explain this issue in detail. So I
>will go back and try to debug the java code and see if the encoding is
>set correctly here.
>On Mon, Jan 25, 2016 at 1:35 PM, Eliot Kimber ekimber@xxxxxxxxxxxx
><xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>> For a situation like this you have to look closely at the chain of
>> of the data as it comes in and out of different tools--any component
>> touches it has the opportunity to mess things up.
>> As others have pointed out, if the data coming in is correct then the
>> going out as produced directly by Saxon should be correct as well. That
>> is, the mapping from Unicode characters to ISO-8859 should be handled
>> correctly by the serializer Saxon is using.
>> The "gibbersh" you're showing is the three bytes of the UTF-8 encoded
>> "REPLACEMENT CHARACTER" interpreted as individual Unicode characters.
>> UTF-8 encoding of this character, Unicode code point FFFD, is 0xEF 0xBF
>> 0xBD. Character 0xEF (239) is i-umlaut in ISO-8859, 0xBF (191) is
>> question mark, and 0xBD (189) is the 1/2 fraction. Thus your gibbersh.
>> (http://www.fileformat.info/info/unicode/char/0fffd/index.htm)
>> So the following is happening somewhere in your tool chain:
>> 1. Something is not recognizing the character you think should be a
>> symbol as a known Unicode character and is replacing it with the UTF-8
>> replacement character.
>> 2. Something is then reading the bytes resulting from (1) as ASCII
>> than UTF-8 and treating each byte of the replacement character sequence
>> individual ASCII characters.
>> 3. The remaining stages don't know any better and continue to treat the
>> characters as characters, resulting in the three characters i-umlaut,
>> inverted question mark, 1/2 fraction in the output.
>> I think the most likely thing is that something is reading the incoming
>> ASCII as Unicode, not recognizing the ASCII byte "0xB0" (degree symbol)
>> a unicode character (because it's not one in any Unicode-defined
>> encoding), and replacing it with the Unicode replacement character.
>> Something then reads this byte sequence as ASCII, not UTF-8 but then
>> generates UTF-8 output (otherwise the byte sequence would be the same on
>> input and output), resulting in the gibberish.
>> Some tools write XML in one encoding but put in a different encoding
>> declaration, e.g., a file is written as ISO-8859 but with a UTF-8
>> declaration. This would lead to the behavior we're seeing here, where
>> degree symbol should be encoded as two UTF-8 bytes but is output as a
>> single ASCII byte.
>> Using Java it's easy to forget to specify the encoding when writing a
>> sequence using a Writer or when constructing a String instance. This
>> result in the bytes being written in the default encoding for the system
>> running the application, which is almost always *not* a Unicode
>> rather than an Unicode encoding. Other languages have similar pitfalls.
>> I find the free Windows tool Unipad to be invaluable when trying to
>> down this type of encoding problem--it does a good job of guessing the
>> real encoding and also has tools for converting between many encodings,
>> inspecting files in uncommon encodings, and so on. However, oXygenXML
>> a lot of good tools for this now, so I depend on Unipad less than I used
>> to 10 years ago. (http://www.unipad.org/main/)
>> Good luck.
>> Cheers,
>> Eliot
>> ----
>> Eliot Kimber, Owner
>> Contrext, LLC
>> http://contrext.com
>> On 1/25/16, 12:36 PM, "a kusa akusa8@xxxxxxxxx"
>> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>The transformed XML itself has the gibberish value for the degree
>>>symbol. So it displays as question marks in IE.
>>>There is a java program that uses the transformation factory to
>>>convert the XML. I view the results XML Spy.
>>>On Mon, Jan 25, 2016 at 12:17 PM, Martin Honnen martin.honnen@xxxxxx
>>><xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>> a kusa akusa8@xxxxxxxxx wrote:
>>>>> And you have <xsl:output omit-xml-declaration="no"/> as well? Does
>>>>> result have an XML declaration? -Yes, there is an XML declaration.
>>>>> Does XML Spy indicate the encoding used to display the file?- Not
>>>>> where to see this. The transformed XML has the encoding set to
>>>>> ISO-8859-1.
>>>> What happens when you load the XML result into a browser like IE or
>>>> Are the characters displayed as you want them?
>>>> As for using Saxon, how do you use, do you run it from the command
>>>> yourself, with -o:result.xml output option? Or is XML Spy running
>>>> maybe not doing it right?

Current Thread