[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Re: Turning escaped mixed content back to XML


Subject: Re: [xsl] Re: Turning escaped mixed content back to XML
From: "Abel Braaksma (Exselt)" <abel@xxxxxxxxxx>
Date: Tue, 01 Apr 2014 03:36:03 +0200

On 28-3-2014 22:49, Martin Holmes wrote:
> On 14-03-28 02:18 PM, David Carlisle wrote:
>> On 28/03/2014 21:06, Martin Holmes wrote:
>>> I spoke too soon. Passing this:
>>>
>>> contains a single TEI-conformant document, comprising a TEI header
>>> and a
>>> text, either in isolation or as part of a
>>> &lt;gi&gt;teiCorpus&lt;/gi&gt;
>>> element.
>>>
>>> into parse-xml-fragment() gets this fatal error:
>>>
>>> FODC0006: First argument to parse-xml-fragment() is not a well-formed
>>> and namespace-well-formed XML fragment. XML parser reported: I/O error
>>> reported by XML parser processing
>>> file:/home/mholmes/Documents/tei/council/translation/new_translations_into_specs.xsl:
>>>
>>>
>>> 404 Not Found for:
>>> http://www.saxonica.com/parse-xml-fragment/actual.xml
>>>
>>
>
> I've tried that, but it seems to make no difference. But my reading of
> the spec suggests that it will accept a mixed-content fragment without
> a root element, though I may be misunderstanding it.
>

Your assumption on fn:parse-xml-fragment() is correct.

I tried your text fragment with fn:parse-xml-fragment on both Saxon and
Exselt and it simply works. Considering that you get a 404 not found
error, suggests there is something off elsewhere in your stylesheet. A
more complete input/output/stylesheet example might help tracking this
one down.

If the input XML is crappy, you can use a self-grown approach towards
translating the escaped XML. The following is not fool-proof, but it
creates XML or almost-XML, depending on your input, which, if the
resulting XML is not fully compliant, will _not_ raise an error.
However, this code does not take entities or escaped
quotes/apostrophes/ampersands, CDATA sections, comments etc into
account. It is not that hard to add them though if your source contains
them, but be aware, it may quickly end up into a "regex parser for XML",
which many on this list will (correctly) frown upon.

But then again, if your input cannot be relied upon for
fn:parse-xml-fragment(), and/or you need to find out how it looks like
without all the escapes for fault-analysis, this may definitely help.

The DTD declarations in the beginning are not required, but I use them
for readability. The chosen character range for forcing the processor to
output angle brackets when it is not XML are from the Private Use Area
of Unicode.

Solution 1
--------------
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:stylesheet [
    <!ENTITY less "&#xE001;">
    <!ENTITY great "&#xE002;">
]>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:data="http://exselt.net/data"
    xmlns:text="http://example.com/text"
    exclude-result-prefixes="xsl data text"
    version="3.0">
   
    <xsl:output indent="yes" use-character-maps="angle-brackets" />
   
    <xsl:character-map name="angle-brackets">
        <xsl:output-character character="&less;" string="&lt;"/>
        <xsl:output-character character="&great;" string="&gt;"/>
    </xsl:character-map>
   
    <data:escaped>
        <text:p>indicates the amount by which this zone has been rotated
            clockwise, with respect to the normal orientation of the parent
            &lt;gi&gt;surface&lt;/gi&gt; element as implied by the
dimensions given
            in the &lt;gi&gt;msDesc&lt;/gi&gt; element or by the
coordinates of the
            &lt;gi&gt;surface&lt;/gi&gt; itself. The orientation is
expressed in arc
            degrees.</text:p>
        <text:p>a start-tag, with delimiters &lt; and &gt; is
intended</text:p>
        <text:p>contains a single TEI-conformant document, comprising a
TEI header and a
            text, either in isolation or as part of a
&lt;gi&gt;teiCorpus&lt;/gi&gt;
            element.</text:p>
    </data:escaped>
   
    <xsl:variable name="data" select="doc('')/*/data:escaped" />
   
    <xsl:template match="/">
        <xsl:apply-templates select="$data/text:p" />
    </xsl:template>
   
    <xsl:template match="text:p">
        <xsl:copy copy-namespaces="no">
            <xsl:apply-templates />
        </xsl:copy>
    </xsl:template>
   
    <xsl:template match="text()">
        <!-- find an opening '<' not followed by a space, until the
first closing '>' -->
        <xsl:analyze-string select="." regex="&lt;([^ &gt;][^&gt;]+)&gt;">
            <xsl:matching-substring>
                <xsl:value-of select="'&less;' || regex-group(1) ||
'&great;'" />
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="." />
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:template>
   
</xsl:stylesheet>

Solution 2
--------------
The following uses fn:parse-xml-fragment and the new xsl:try/xsl:catch
to fix the fragment if an error occurs. Again, this is not foolproof,
but as a fallback, it simply dumps the string as it is when it cannot be
processed.

Note that I deliberately changed also the 3rd text to be invalid,b ut
with only one error so that it can be fixed by the "fixup" part, and
note that the recursive nature of this solution is currently very
limited, but once better errors are available in try/catch (with
line-number and column-number), you might use this as a starting point
for an XML cleanup function ;).

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:data="http://example.com/data"
    xmlns:text="http://exselt.net/text"
    xmlns:err="http://www.w3.org/2005/xqt-errors"
    exclude-result-prefixes="xs xsl data text err"
    version="3.0">
   
    <xsl:output indent="yes"/>
   
    <data:escaped>
        <text:p>
            indicates the amount by which this zone has been
            rotated clockwise, with respect to the normal
            orientation of the parent &lt;gi&gt;surface&lt;/gi&gt;
            element as implied by the dimensions given in the
            &lt;gi&gt;msDesc&lt;/gi&gt; element or by the
            coordinates of the &lt;gi&gt;surface&lt;/gi&gt;
            itself. The orientation is expressed in arc degrees.
        </text:p>
        <text:p>
            a start-tag, with delimiters &lt; and &gt; is intended
        </text:p>
        <text:p>
            contains a single &lt;TEI-conformant document,
            comprising a TEI header and a text, either in isolation
            or as part of a &lt;giteiCorpus&lt;/gi&gt; element.
        </text:p>
    </data:escaped>
   
    <xsl:variable name="data" select="doc('')/*/data:escaped" />
   
    <xsl:template match="/">       
        <xsl:apply-templates select="$data/text:p" />
    </xsl:template>
   
    <xsl:template match="text:p">
        <xsl:copy copy-namespaces="no">
            <xsl:apply-templates mode="parse" />
        </xsl:copy>
    </xsl:template>
   
    <xsl:template match="." mode="parse">
        <xsl:param name="recur" as="xs:boolean" select="true()" />
        <xsl:try>
            <xsl:copy-of select="parse-xml-fragment(.)" />
           
            <!-- when parsing fails, this is the error -->
            <xsl:catch errors="err:FODC0006">

                <!-- recursively apply templates until we are fixed
                     currently max one level deep, should use
                     $err:line/col-number once that is available -->
                <xsl:variable name="pos"
                    select="string-length(substring-before(., '&lt;'))" />

                <!-- poor man's error fixing -->
                <xsl:variable name="fixed"
                    select="
                    substring(., 1, $pos) ||
                    substring(., $pos + 1, 1)!replace(., '&lt;',
'&amp;lt;') ||
                    substring(., $pos + 2)" />

                <!-- using Dimitre's style ifs for recursion ;) -->
                <xsl:apply-templates select="$fixed[$recur]"
mode="#current" >
                    <xsl:with-param name="last" select="false()" />
                </xsl:apply-templates>
                <xsl:copy-of select="$fixed[not($recur)]"  />
            </xsl:catch>
        </xsl:try>       
    </xsl:template>
</xsl:stylesheet>

Both stylesheets should work cross-processor. I tried them with Exselt
and Saxon.

Not sure all of this is of any use for your current use-case, but it was
a nice excercise to play around with, and it made me find some issues in
either processor related to error handling and applying predicates to
strings (both which I will report appropriately).

Cheers,

Abel Braaksma
Exselt XSLT 3.0 processor
http://exselt.net


Current Thread
Keywords