[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Recognized Unicode characters?


Subject: Re: [xsl] Recognized Unicode characters?
From: David Carlisle <davidc@xxxxxxxxx>
Date: Mon, 9 May 2005 14:52:12 +0100

  I set output to HTML because that is the output I am creating. (isn't this 
  right?)

yes (it's the default anyway if the top level result element is html in
no namespace) but setting it doesn't do any harm)

  As I understand it, shouldn't the XSLT processor know from the "encoding" 
  attribute that the references will be to Unicode numbers and read them 
  correctly as those characters.

That's not how it works. In XML a character reference & # 1 2 3 ;
_always_ refers to a unicode character number, irrespective of the
encoding.

the encoding tells the system what characters the actual bytes in the
file mean, so for example if your file has "abc" it doesn't really have
the letters a b ac it has bytes with values 97 98 and 99. In order to
know that 97 is a the system needs to know what  encoding the file is
in. 97=a is the ASCII encoding of a and many encoings are compatible
with this, o there is a tendency to think that this is some universal
law, but it's not the case, and XML doesn't assume ASCII compatible
encodings, and in fact it mandates teh support for one non-ascii
compatible encoding, utf16, when an a would be encoded with two bytes
one with value 0 and one with value 97.


  So, I am still confused why a Unicode reference to #8212 won't output 
  correctly? The ouput displays a square box in both the browser (IE6) as well 
  as in the HTML source itself (viewed via Windows notepad).

Your stylesheet processor has some leeway into what encoding it uses, it
can ignore the hints in xsl:output so the important thing is what did it
actually use, not just what did you ask for. If you are outputting via
the html method then
a) many processors will output this character as & m d a s h ;
b) whether they do or not, they should declare the encoding that is
   actually used in the file by adding a < meta> element with an http-equiv
   that specifies the encoding used.

If your output is using utf-8 and the character is output as a character
in that encoding (rather than a character reference & # ... or an entity
reference & m d a s...) Then it will work so long as your browser is set
up to view in utf8. this may or may not be automatic depending on
browser settings, see the view/encoding menu option in IE6.


In the XML output method a character that is not in the encoding will be
output using a character refernce. UTF8 encodise all of unicode so if
you output in that encoding you would not expect to see character
references in the output. If on the other hand you output to encoding
US-ASCII then only ascii characters can be output directly so any
non-ascii character will be output using a character reference.
The advantage here is that the file itself is then just ascii encoded so
will work on browsers which don't have encoding support correctly set
up. The disavantage is that if any non-ascii character is used in a
place where you can not represent it by a character reference (for
example if an element name  or the content of a comment,uses such a
character then you will get no output and a fatal error that your result
can not be produced in the specified encoding.

David


________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________


Current Thread
Keywords