[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] Processing XML with multiple nested CDATA sections


Subject: [xsl] Processing XML with multiple nested CDATA sections
From: dvint@xxxxxxxxx
Date: Thu, 28 Feb 2013 15:47:43 -0800

I have an XML file that is an export from a Wiki site. The management
information for the wiki is in clear XML, bu tthe information contained in
the pages (actual content) has been wrapped in CDATA sections. Some of
these CDATA sections have CDATA sections in them. I need to extract the
content and create individual files for each of the pages.

So my first hurdle is unwrapping all these CDATA sections. I was handling
the first one with a simple

<xsl:result-document method="xml" href="{element/id}.html">
    <html xmlns:ac="foo" xmlns:ri="bar">
            <xsl:value-of  disable-output-escaping="yes"
select="normalize-space(key('objects',
id)/property[@name='body'])"/>
    </html>
</xsl:result-document>

Is there some trick to deal with the nesting that I might try? So far it
looks like I have about 3 levels to deal with. Content I'm processing
looks like this:
<hibernate-generic datetime="2012-12-30 17:00:12">
<object class="BodyContent" package="com.atlassian.confluence.core">
		<id name="id">37749131</id>
		<property name="body">
			<![CDATA[<p>Creating Inted.</p><p>You can also ptions.</p>
<h1>Generating</h1><p><ac:link><ri:page ri:content-title="Types of
Widgets" /><ac:plain-text-link-body><![CDATA[Infographic widgets]] >
</ac:plain-text-link-body></ac:link> are ways.</p>]]>
		</property>
		<property name="content" class="Page"
package="com.atlassian.confluence.pages">
			<id name="id">37716459</id>
		</property>
		<property name="bodyType">2</property>
	</object>
</hibernate-generic>

This is my typical situation where there are little CDATA sections for the
filenames, but I have seen other situations where large sections of
content have been wrapped this way as well.

I can brute force this and process my output file several times to finally
cleanup all the CDATA sections, but I would like to be more elegant.

Also I will have a need to de-reference these <id> elements within the
original context, so even my current simple approach is going to cause
problems. This current approach has extracted the content so I can take a
look at it easily, but ultimately I really would like to submit that first
CDATA section in the <property> element for additional processing. For
instance those <ac:link> elements need to be converted to a different
linking structure like the more typical <a href=""> form.

..dan


Current Thread
Keywords
xml