[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] using xsl:message with UTF-8 characters

Subject: Re: [xsl] using xsl:message with UTF-8 characters
From: "Andrew Welch" <andrew.j.welch@xxxxxxxxx>
Date: Mon, 23 Apr 2007 14:21:04 +0100

On 4/23/07, Abel Braaksma <abel.online@xxxxxxxxx> wrote:

When the
Regional settings are set to US or some Western European country, the
codepage will default to CP1252 (windows-1252) (which is, like I said,
incompatible with the codepage for the console, giving the weird
characters in the U+0127+ range).


In the 8-bit character range there are two blocks C0 and C1 which
contain "control characters" which are non-printable characters which
were used to control the printing equipment, for example "move print
head here" (sorry for the lack of depth here :)

Apparently Microsoft decided to wedge more characters into the 8-bit
range by replacing characters in the C0 and C1 ranges with more useful
characters, which seems fair enough, but this is the only encoding
(afaik) which remaps these two ranges.

The problem arises when you save any file without being explicit with
the encoding, and reading back in any other encoding.  This happens a
lot (in Windows) when you save an XML file with a non-xml-aware editor
(say notepad), and then open it in an XML aware editor.  The file will
be saved in CP1252, and with characters like "en dash" and "em dash"
being saved as #150 and #151 instead of #8211 and #8212 respectively.
So when you open the file in using an XML aware editor it reads the
xml prolog and reads the file in say, UTF-8, and you get non-printable
characters instead of the dashes... which can be represented as either
a box or a question mark depending on (...I'm not sure what that
depends on actually).

To compound the issue, if your XML is specified as IS0-8859-1 in the
prolog, some MS tools will read the characters in the control ranges
and auto-switch the encoding to CP1252, giving the impression
everything is fine.

The simple rule is, always read and write using the same encoding, and
be aware when something is converting between characters and bytes
behind the scenes - servlets for example.  Make sure the font you're
viewing the result in contains the glyphs for the characters you're
trying to view (helpfully the no-glyh character is often the same box
or question mark used to mean no-mapping in the encoding...requiring a
hex editor to check the underlying bytes), and be certain the viewer
is showing the result in the right encoding (the cmd window here, or
say the Eclipse output window is another notorious spot)


cheers
andrew

Current Thread
Re: [xsl] using xsl:message with UTF-8 characters, (continued) Andrew Welch - Mon, 23 Apr 2007 13:19:35 +0100 Abel Braaksma - Mon, 23 Apr 2007 14:45:05 +0200 Manfred Staudinger - Mon, 23 Apr 2007 21:50:04 +0200 Abel Braaksma - Mon, 23 Apr 2007 14:26:20 +0200 Andrew Welch - Mon, 23 Apr 2007 14:21:04 +0100 <= Abel Braaksma - Mon, 23 Apr 2007 15:46:44 +0200 Michael Kay - Mon, 23 Apr 2007 13:44:34 +0100 Abel Braaksma - Mon, 23 Apr 2007 15:36:29 +0200 Michael Kay - Mon, 23 Apr 2007 15:27:12 +0100

Current Thread

Re: [xsl] using xsl:message with UTF-8 characters, (continued)

<- Previous	Index	Next ->
Re: [xsl] using xsl:message with UT, Abel Braaksma	Thread	Re: [xsl] using xsl:message with UT, Abel Braaksma
Re: [xsl] using xsl:message with UT, Abel Braaksma	Date	Re: [xsl] using xsl:message with UT, Abel Braaksma
	Month

Keywords

xml

Re: [xsl] using xsl:message with UTF-8 characters

Products

Features

Shop

Resources

Support

Company