[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
On 4/23/07, Abel Braaksma <abel.online@xxxxxxxxx> wrote:
Re: [xsl] using xsl:message with UTF-8 characters
Subject: Re: [xsl] using xsl:message with UTF-8 characters From: "Andrew Welch" <andrew.j.welch@xxxxxxxxx> Date: Mon, 23 Apr 2007 14:21:04 +0100 |
On 4/23/07, Abel Braaksma <abel.online@xxxxxxxxx> wrote:
When the Regional settings are set to US or some Western European country, the codepage will default to CP1252 (windows-1252) (which is, like I said, incompatible with the codepage for the console, giving the weird characters in the U+0127+ range).
In the 8-bit character range there are two blocks C0 and C1 which contain "control characters" which are non-printable characters which were used to control the printing equipment, for example "move print head here" (sorry for the lack of depth here :)
Apparently Microsoft decided to wedge more characters into the 8-bit range by replacing characters in the C0 and C1 ranges with more useful characters, which seems fair enough, but this is the only encoding (afaik) which remaps these two ranges.
The problem arises when you save any file without being explicit with the encoding, and reading back in any other encoding. This happens a lot (in Windows) when you save an XML file with a non-xml-aware editor (say notepad), and then open it in an XML aware editor. The file will be saved in CP1252, and with characters like "en dash" and "em dash" being saved as #150 and #151 instead of #8211 and #8212 respectively. So when you open the file in using an XML aware editor it reads the xml prolog and reads the file in say, UTF-8, and you get non-printable characters instead of the dashes... which can be represented as either a box or a question mark depending on (...I'm not sure what that depends on actually).
To compound the issue, if your XML is specified as IS0-8859-1 in the prolog, some MS tools will read the characters in the control ranges and auto-switch the encoding to CP1252, giving the impression everything is fine.
The simple rule is, always read and write using the same encoding, and be aware when something is converting between characters and bytes behind the scenes - servlets for example. Make sure the font you're viewing the result in contains the glyphs for the characters you're trying to view (helpfully the no-glyh character is often the same box or question mark used to mean no-mapping in the encoding...requiring a hex editor to check the underlying bytes), and be certain the viewer is showing the result in the right encoding (the cmd window here, or say the Eclipse output window is another notorious spot)
cheers andrew
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] using xsl:message with UT, Abel Braaksma | Thread | Re: [xsl] using xsl:message with UT, Abel Braaksma |
Re: [xsl] using xsl:message with UT, Abel Braaksma | Date | Re: [xsl] using xsl:message with UT, Abel Braaksma |
Month |
Keywords