[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Turning escaped mixed content back to XML

Subject: Re: [xsl] Turning escaped mixed content back to XML
From: Graydon <graydon@xxxxxxxxx>
Date: Fri, 28 Mar 2014 15:17:53 -0400

On Fri, Mar 28, 2014 at 12:02:11PM -0700, Martin Holmes scripsit:
> On 14-03-28 11:32 AM, Graydon wrote:
> >On Fri, Mar 28, 2014 at 11:12:37AM -0700, Martin Holmes scripsit:
> >[getting escaped text back into parsed content]
> >>     <xsl:template match="text:p" exclude-result-prefixes="#all">
> >>         <xsl:variable name="unparsed">
> >>             <xsl:copy-of select="*|text()"/>
> >>         </xsl:variable>
> >
> >$unparsed is going to be item()* instead of string if it's formed like
> >that, and I don't think saxon:parse will work on item()* as input, it
> >wants a single string.
> That's why I'm trying to use saxon:serialize to feed into saxon:parse.
> But even if I feed the string-joined text nodes directly into
> saxon:parse(), it fails; I get a "Content not allowed in prolog"
> error, presumably because there's no containing root element in the
> unparsed string. If I try to add that:

Yes.  serialize() and parse() want well-balanced trees, I think the
phrase is; something that could be a document if it was off by itself.

parse-fragment-string() doesn't, and it might be a better bet for the
data you've got.

>     <xsl:template match="text:p" exclude-result-prefixes="#all">
>         <xsl:variable name="unparsed" select="concat('&lt;p&gt;',
> string-join(//text(), ''), '&lt;/p&gt;')"/>
>         <xsl:variable name="parsed" select="saxon:parse($unparsed)"/>
>          <xsl:copy-of select="$parsed" exclude-result-prefixes="#all"/>
>     </xsl:template>
> I get "The entity name must immediately follow the '&' in the entity
> reference," which is a bit puzzling...

Is it possible you've got &amp; entities (or other default XML entities)
in the data?  Those tend to make this whole serialization/parse process
really unpleasant.

If not, xsl:message and xsl:sequence and dump the value you're trying to
parse and see what it really looks like.  One of the other problems with
markup escaped as text is that there isn't anything parsing it until you
try and it can lose angle brackets and gain spaces in bad places and so
on and there often isn't any good automated way to fix that.

-- Graydon

Current Thread