[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Trouble with special characters


Subject: Re: [xsl] Trouble with special characters
From: "a kusa akusa8@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 25 Jan 2016 20:48:51 -0000

Thanks a lot for taking the time to explain this issue in detail. So I
will go back and try to debug the java code and see if the encoding is
set correctly here.



On Mon, Jan 25, 2016 at 1:35 PM, Eliot Kimber ekimber@xxxxxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> For a situation like this you have to look closely at the chain of custody
> of the data as it comes in and out of different tools--any component that
> touches it has the opportunity to mess things up.
>
> As others have pointed out, if the data coming in is correct then the data
> going out as produced directly by Saxon should be correct as well. That
> is, the mapping from Unicode characters to ISO-8859 should be handled
> correctly by the serializer Saxon is using.
>
> The "gibbersh" you're showing is the three bytes of the UTF-8 encoded
> "REPLACEMENT CHARACTER" interpreted as individual Unicode characters. The
> UTF-8 encoding of this character, Unicode code point FFFD, is 0xEF 0xBF
> 0xBD. Character 0xEF (239) is i-umlaut in ISO-8859, 0xBF (191) is inverted
> question mark, and 0xBD (189) is the 1/2 fraction. Thus your gibbersh.
> (http://www.fileformat.info/info/unicode/char/0fffd/index.htm)
>
> So the following is happening somewhere in your tool chain:
>
> 1. Something is not recognizing the character you think should be a degree
> symbol as a known Unicode character and is replacing it with the UTF-8
> replacement character.
>
> 2. Something is then reading the bytes resulting from (1) as ASCII rather
> than UTF-8 and treating each byte of the replacement character sequence as
> individual ASCII characters.
>
> 3. The remaining stages don't know any better and continue to treat the
> characters as characters, resulting in the three characters i-umlaut,
> inverted question mark, 1/2 fraction in the output.
>
> I think the most likely thing is that something is reading the incoming
> ASCII as Unicode, not recognizing the ASCII byte "0xB0" (degree symbol) as
> a unicode character (because it's not one in any Unicode-defined
> encoding), and replacing it with the Unicode replacement character.
>
> Something then reads this byte sequence as ASCII, not UTF-8 but then
> generates UTF-8 output (otherwise the byte sequence would be the same on
> input and output), resulting in the gibberish.
>
> Some tools write XML in one encoding but put in a different encoding
> declaration, e.g., a file is written as ISO-8859 but with a UTF-8 encoding
> declaration. This would lead to the behavior we're seeing here, where the
> degree symbol should be encoded as two UTF-8 bytes but is output as a
> single ASCII byte.
>
> Using Java it's easy to forget to specify the encoding when writing a byte
> sequence using a Writer or when constructing a String instance. This will
> result in the bytes being written in the default encoding for the system
> running the application, which is almost always *not* a Unicode encoding,
> rather than an Unicode encoding. Other languages have similar pitfalls.
>
> I find the free Windows tool Unipad to be invaluable when trying to track
> down this type of encoding problem--it does a good job of guessing the
> real encoding and also has tools for converting between many encodings,
> inspecting files in uncommon encodings, and so on. However, oXygenXML has
> a lot of good tools for this now, so I depend on Unipad less than I used
> to 10 years ago. (http://www.unipad.org/main/)
>
> Good luck.
>
> Cheers,
>
> Eliot
>
> ----
> Eliot Kimber, Owner
> Contrext, LLC
> http://contrext.com
>
>
>
>
> On 1/25/16, 12:36 PM, "a kusa akusa8@xxxxxxxxx"
> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>>The transformed XML itself has the gibberish value for the degree
>>symbol. So it displays as question marks in IE.
>>
>>There is a java program that uses the transformation factory to
>>convert the XML. I view the results XML Spy.
>>
>>On Mon, Jan 25, 2016 at 12:17 PM, Martin Honnen martin.honnen@xxxxxx
>><xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>>> a kusa akusa8@xxxxxxxxx wrote:
>>>>
>>>> And you have <xsl:output omit-xml-declaration="no"/> as well? Does the
>>>> result have an XML declaration? -Yes, there is an XML declaration.
>>>>
>>>> Does XML Spy indicate the encoding used to display the file?- Not sure
>>>> where to see this. The transformed XML has the encoding set to
>>>> ISO-8859-1.
>>>
>>>
>>> What happens when you load the XML result into a browser like IE or
>>>Firefox?
>>> Are the characters displayed as you want them?
>>>
>>> As for using Saxon, how do you use, do you run it from the command line
>>> yourself, with -o:result.xml output option? Or is XML Spy running Saxon
>>>and
>>> maybe not doing it right?


Current Thread
Keywords