[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] HTML tags from XML content


Subject: Re: [xsl] HTML tags from XML content
From: Mike Brown <mike@xxxxxxxx>
Date: Sat, 10 Feb 2001 02:20:19 -0700 (MST)

Mingbo Qin wrote:
> Hell,

Well, damn! :) Hello, I think you mean.

> I am trying to transform an XML file to HTML format. The original XML
> elements contain some HTML formatting tags in "&amp;lt;P&amp;gt;" format.

Garbage in, garbage out!

The usual problem is the XML contains "&lt;P&gt;", but you've got it
doubly-escaped, don't you. This indicates a bigger problem. You are using
XML for something it was really not designed for. You (or whoever is
writing this XML) are cramming structured markup into a character data
content of an element, and then trying to extract it and treat it as
something other than the text it literally represents.

First, acknowledge the fact that "&amp;lt;P&amp;gt;" in XML means nothing
more than the sequence of 9 characters & l t ; P & g t ; ...That is, as
far as the XML parser and the XPath/XSLT processor is concerned, this not
a <P> start tag representing one of the boundaries of a 'P' element; it's
just a text string.

Next, consider that an XSLT processor, if you tell it to emit HTML, is
going to treat such a string in the result tree as just text, and it will
serialize it in such a way that it will not be confused with markup. Thus,
the "&" characters are going to be escaped as &amp; again upon output. The
remaining characters don't need to be escaped. So the output will be
"&amp;lt;P&amp;gt;" in the HTML. Your browser will render this as
"&lt;P&gt;" as you have noted.

> I want this to be converted to "<P>". My guess is even if somehow I can make
> this to happen, a "<P>" string will be diplayed on the browser. 

XSLT gives processors the option of supporting the
disable-output-escaping="yes" attribute on the XSLT instruction elements
that result in the creation of text nodes (xsl:value-of and xsl:text). If
the processor supports it, the text node will be emitted with output
escaping disabled, so you could conceivably get "&lt;P&gt;" in the HTML,
which will be rendered as "<P>" in the browser. So your guess is correct.

There are reasons why disable-output-escaping is bad. It is optional, for
one thing, so you can't be assured that your code will be portable. It
also can result in the production of output that does not conform to
standards and thus may not be able to be read back in. In the case of
plain old HTML this is of little concern, since browsers expect to get tag
soup anyway, but in the case of XHTML or any other XML, it's a big deal
and something you want to avoid.

What you want to do is translate occurrences of the characters & l t ; and
& g t ; in a string to just < and >, respectively. Since the XPath
translate() function only works with single characters, you will have to
do this with a recursive named template. I have an example of this
technique at http://skew.org/xml/stylesheets/replace/

Good luck, and try to do something about that XML. XML is just not a good
carrier for HTML. Maybe run the HTML through Tidy first to make it XHTML,
so you don't have to worry about this stuff and can just use xsl:copy-of?

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at            My XML/XSL resources: 
webb.net in Denver, Colorado, USA              http://skew.org/xml/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



Current Thread
Keywords