[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] unreadable characters from indesign


Subject: Re: [xsl] unreadable characters from indesign
From: Marc Lambrichs <marc.lambrichs@xxxxxxxxxxxxx>
Date: Thu, 18 Jan 2007 02:24:16 +0100

Abel Braaksma wrote:

Marc Lambrichs wrote:

I'm reading in an xml-feed from Adobe InDesign and in some nodes there are three characters that can't be interpreted by my xsl-translation using utf-8. The codepoints of these 3 are (octal) 226, 128, 169. First of all, I would like to know what these characters should represent. And secondly, could I filter these characters out using something like translate?


This is not possible. Of the range 226, 128 and 169 are octal, you mistyped at least the digits '8' and '9'.


Assuming you meant decimal, and you are talking about codepoints indeed, then there cannot be any problem in reading it, the codepoints 226, 128 and 169 represent the string b&#128;) (not sure the mailer messes this up), which are:

U+00E2, LATIN SMALL LETTER A WITH CIRCUMFLEX
U+0080, control
U+00A9, COPYRIGHT SIGN

See http://www.unicode.org/Public/UNIDATA/UnicodeData.txt for a full list of codepoints.

In UTF-8, this is encoded as the following octets (view your input hexadecimal and you can see if this is indeed correct):
U+00E2 >>> C3A2
U+0080 >>> C280
U+00A9 >>> C2A9


I am not sure what you mean with "can't be interpreted by my xsl-translation using utf-8", because any valid XSLT processor understands at least UTF-8 and UTF-16. However, if what you mean is that these characters are there and should be removed, you can indeed use translate() to remove them:

translate($yourinput, '&#226;&#128;&#169", '')

But if what you mean is that the input has somehow these three values encoded in such a way that it is not UTF-8, then you will have to change your input, because it is not possible to process non-UTF-8 (meaning: containing illegal utf-8 sequences) as if it were UTF-8.

Cheers,
-- Abel Braaksma
  http://www.nuntia.nl

Sorry, no mistype, sheer stupidity on my behalf. Rereading the message I'm sure I should have asked the top half of the question in some Adobe newsgroup, because I still don't understand how those characters end up in my xml and what they should represent. The second half shows how to get rid of them, at the least.

Cheers,
Marc


Current Thread
Keywords