[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

RE: [xsl] extracting data in CDATA block of a XML document


Subject: RE: [xsl] extracting data in CDATA block of a XML document
From: "Curtis Burisch" <curtis@xxxxxxxxxx>
Date: Sat, 24 Aug 2002 04:54:23 +0100

Good answer Mike, but probably not that useful. I had the same answer some
months ago from someone else...

Sometimes there is valid XML in the CDATA section. That was the case in my
situation. The solution in that case was to write an extension function
(we're using Xalan-C++) to extract the contents of the specified node within
the CDATA section passed as a parameter. Annoying, roundabout, kludgy, but
serviceable.

c


-----Original Message-----
From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx
[mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx]On Behalf Of Mike Brown
Sent: 23 August 2002 17:10
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: [xsl] extracting data in CDATA block of a XML document


Srinivas Ch wrote:
> Now I need to extract all the elements between the
> <![CDATA[ and ]]> and write it into a new xml file.

This is a FAQ, but we all like to give long-winded answers rather than point
you to www.dpawson.co.uk.

The other answers to your question so far have been trying to tell you:

1. What you want is not possible with XSLT, at least not in a way that is
reliable. We aren't going to tell you the unreliable way because you need to
approach this problem differently if you don't want to get burned.

2. It was a poor design decision to embed structured markup in the character
data content of an XML element. Character data is by definition NOT MARKUP.

3. CDATA sections are a convenience for document authors and are relevant
for
input only. They just keep you from having to escape "<" and "&" in
character
data. It means "this looks like markup but it isn't really". The idea is
that

<foo><![CDATA[<bar/>]]></foo>

and

<foo>&lt;bar/></foo>

mean exactly the same thing: An element named 'foo' containing the 6
characters '<bar/>'; NOT an element named 'foo' containing an empty element
named 'bar'. If you wanted the latter, you'd have written <foo><bar/></foo>.

In XPath/XSLT you deal with a node tree that is set up quite similarly:

element 'foo' in no namespace
  |
  |__text '<bar/>'

The text node is going to be what you see there, regardless of whether you
used a CDATA section in the original document.

Since you want XML output, your question is how do you produce a result
tree that looks like this

  element 'bar' in no namespace

And the answer is, that's pretty darn difficult because you would have to
mimic the duties of an XML parser, tearing apart the string in the text node
in order to build the right nodes in the result tree.

The workaround that some idiot is going to suggest with a "hey it works for
me!" but not realizing how unportable it is, is going to involve leaving the
text node unchanged but flagging it as an exceptional case for unmodified
serialization, so that it will be emitted as a string of what could very
well
be total garbage in the middle of proper, well-formed XML. And that's
assuming
you're serializing the result tree, which isn't always a good assumption (in
a
browser-based processor you're likely to be passing it as a DOM).

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list




 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



Current Thread
Keywords