[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] using xsl:message with UTF-8 characters


Subject: Re: [xsl] using xsl:message with UTF-8 characters
From: "Andrew Welch" <andrew.j.welch@xxxxxxxxx>
Date: Mon, 23 Apr 2007 14:21:04 +0100

On 4/23/07, Abel Braaksma <abel.online@xxxxxxxxx> wrote:
When the
Regional settings are set to US or some Western European country, the
codepage will default to CP1252 (windows-1252) (which is, like I said,
incompatible with the codepage for the console, giving the weird
characters in the U+0127+ range).

In the 8-bit character range there are two blocks C0 and C1 which contain "control characters" which are non-printable characters which were used to control the printing equipment, for example "move print head here" (sorry for the lack of depth here :)

Apparently Microsoft decided to wedge more characters into the 8-bit
range by replacing characters in the C0 and C1 ranges with more useful
characters, which seems fair enough, but this is the only encoding
(afaik) which remaps these two ranges.

The problem arises when you save any file without being explicit with
the encoding, and reading back in any other encoding.  This happens a
lot (in Windows) when you save an XML file with a non-xml-aware editor
(say notepad), and then open it in an XML aware editor.  The file will
be saved in CP1252, and with characters like "en dash" and "em dash"
being saved as #150 and #151 instead of #8211 and #8212 respectively.
So when you open the file in using an XML aware editor it reads the
xml prolog and reads the file in say, UTF-8, and you get non-printable
characters instead of the dashes... which can be represented as either
a box or a question mark depending on (...I'm not sure what that
depends on actually).

To compound the issue, if your XML is specified as IS0-8859-1 in the
prolog, some MS tools will read the characters in the control ranges
and auto-switch the encoding to CP1252, giving the impression
everything is fine.

The simple rule is, always read and write using the same encoding, and
be aware when something is converting between characters and bytes
behind the scenes - servlets for example.  Make sure the font you're
viewing the result in contains the glyphs for the characters you're
trying to view (helpfully the no-glyh character is often the same box
or question mark used to mean no-mapping in the encoding...requiring a
hex editor to check the underlying bytes), and be certain the viewer
is showing the result in the right encoding (the cmd window here, or
say the Eclipse output window is another notorious spot)


cheers andrew


Current Thread
Keywords
xml