[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
Re: more encoded questions
Subject: Re: more encoded questions From: Mike Brown <mike@xxxxxxxx> Date: Mon, 6 Nov 2000 22:51:36 -0700 (MST) |
Josef Vosyka wrote: > Characters are being rendered according to > a) input encoding > b) input form (escaped/non-escaped) no. the xml document is typically a bit sequence like 110101010101010111010101111110010101010101010111111... these represent ISO/IEC 10646-1:1993 (UCS) (~Unicode) characters like <?xml version="1.0" encoding="utf-8"?> <doc> <element attribute="cdata">characterdata</element> </doc> this mapping of bits to UCS characters is the encoding (essentially). the encoding declaration in the XML declaration is only for helping to determine the encoding. once the document is decoded, it is irrelevant. it is at that point all UCS characters. after decoding the document, the xml parser resolves character and certain entity references, turning them into UCS characters too. in the example above,  becomes the space character. the UCS characters at this level imply the logical structures: elements, attributes, character data. these structures are reported by the parser to the application (the XSLT processor). so you see, you can say  or  or refer to an entity that you defined as the space character, or put the encoded bits for the character into the binary document ... it doesn't matter; it all means the same thing, once it goes through the parser. the XSLT processor only knows about the single space character that was meant, not the 5 characters ''. those were just 'physical' markup. now consider that the stylesheet is itself an xml document that is parsed just like the source document. the xslt processor acts on the logical structures. the stylesheet is not a literal specification for output. it is only a representation of how to build the result tree. character references in the stylesheet are just an abstraction for the individual characters that will actually be manipulated by the processor. the stylesheet's instructions result in the creation of a node tree -- the result tree. depending on what you put in the xsl:output element's 'method' and 'encoding' attributes, this tree will be serialized in different ways. the serialization for xml and html output methods will be as bits in the given encoding. the method might affect whether, say, UCS character 160 (non-breaking space) is output as the encoded bits for the single character number 160, or as the encoded bits for the character sequence ' ', or as the encoded bits for the character sequence ' ' or ' '. I wrote a lot about this at http://www.skew.org/xml/tutorial/ because I was disappointed that XML books make very little effort to address these issues. Concepts like encoding and logical structures should come first. Syntax and code samples come last, and are almost inconsequential, once you understand the principles at work. Instead, everyone teaches these things backward, and you end up with situations like this, where your impression of the meaning of a character reference is shaped by the way HTML user agents behave(d). I think you are under the impression that character references are related to the encoding of the document. They are not. They are by definition, in both HTML and XML, references to characters in one specific repertoire. - Mike ____________________________________________________________________ Mike J. Brown, software engineer at My XML/XSL resources: webb.net in Denver, Colorado, USA http://www.skew.org/xml/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
more encoded questions, Josef Vosyka | Thread | xsl doubts, Subin Thampi |
Re: AW: AW: Encoded question, Mike Brown | Date | Re: AW: Encoded question, Miloslav Nic |
Month |