[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Generating numeric character references


Subject: Re: [xsl] Generating numeric character references
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Tue, 14 Jan 2003 16:55:08 -0500

Stuart,

The reason your task is proving difficult is that it's really not what it appears to be at first blush. There is a trap here, which you can recognize if you can clearly distinguish between XML-as-serialization format, and the XML document (a tree of nodes as described in the XPath spec) that an XSLT processor operates on.

Numeric character references may appear in XML-as-serialization; in the XPath tree (the "document" built by the parser and handed to the XSLT engine), however, these references never appear as such; rather, each has been converted into the character it represents.

So, for example, if your data has character reference &#x41;, your XSLT processor sees this as an "A". (It may come out the back as "&#x41;" if your serialization encoding happens not to be able to do a proper "A", but internally it's an "A"). Therefore, what's required with "&amp;#x41;" isn't to turn it into "&#x41;", but rather into "A". (Or, if you get my drift: you need to convert "&amp;#x41;" into "&#x41;" *before* your document is parsed, or an "&#x41;" into an "A" *after* your document is parsed.)

You are currently trying to do the latter; and it can be done -- as you're discovering -- with recursive processing over text nodes, heuristics to recognize target substrings, and a table to map them. But it's not a job that XSLT lends itself towards, since XSLT is as ungainly for processing strings as it is slick for processing nodes. Far preferable would be to use Perl or something else with good support for string-handling and regular expressions, to do the former task (munge the &amp; entities before parsing).

Yet -- and this is where one has to be *very* cautious -- XSLT does, at least in certain circumstances (i.e. with certain processors in certain operational contexts) give you *some* control over how a document, once processed, is serialized -- and *if your data is clean* this optional feature can be drafted into service to help with your problem. What I'm getting to, of course, is the dreaded disable-output-escaping....

That is, if your data is otherwise unproblematic, you can achieve your goal by running your document through a near-identity transform that disables output escaping on your text nodes. The document will emerge from the transform unchanged (at least as XPath sees it) but with "&amp;#x41" represented as "&#x41;". This, *when parsed again*, will be seen as the "A" you really want.

Note that this is not (if we're really strict with our terms) a transformation in the XSLT sense. Rather, it's a tricky application of the serializer attached to most processors, will sometimes break because it disables escaping on the wrong characters (so if you have any data such as "if x &lt; y", you're going to be in trouble unless you write string-processing code to catch and work around it), and uses an optional feature of the language that restricts portability.

Please consider this only a golden-hammer solution (i.e. lacking a better tool to do the job), and keep in mind it's easy to bang your thumb this way (since any anomalies in the input will make your output not well-formed). It is in view of these limitations that this really should be done in a separate pass, if with XSLT at all.

Cheers,
Wendell

At 03:05 PM 1/14/2003, you wrote:
I'd like to transform specific text subtrings into numeric character
references during in an XSLT transformation. For example, I want to
transform all occurrences that look like "&amp;#173;" within a string
into "&#173".

Here's the back story. I have source XML that is generated automatically
from HTML by a third-party. The third-party incorrectly handles entity
references, so that "&#173;" in the original HTML in becomes
"&amp;#173;" in the XML. I want to restore the damage done. To simplify
things, I am only interested in documents with ISO 8859-1 encoding.

Below is a solution [1] that I am not pleased with. It is a named
template that recursively parses a string, trying to replace references.
This requires an <xsl:when> element for each value of numeric character
reference that might be encountered (see the "additional cases here"
comment). Problems with this include linear search of values, omitted
values, and opportunity for error in mismatched values.

Can anyone suggest a better approach to generating numeric character
references? I am would be fine restricting the solution to MSXML or
.NET's System.Xml.Xsl XSLT processors, if that is an issue.

Many thanks!

Cheers,
Stuart



[1] A less than happy solution:

  <xsl:template name="restoreNumCharRefs">
    <xsl:param name="string"/>

    <xsl:choose>
      <xsl:when test="contains($string, '&amp;')">
        <xsl:variable name="head" select="substring-before($string,
'&amp;')"/>
        <xsl:variable name="remainder" select="substring-after($string,
'&amp;')"/>
        <xsl:variable name="reference"
select="substring-before($remainder, ';')"/>

        <xsl:variable name="entity">
          <xsl:choose>
            <xsl:when test="$reference='#167'">&#167;</xsl:when>
            <xsl:when test="$reference='#173'">&#173;</xsl:when>

<!-- additional cases here -->

            <xsl:otherwise>&amp;<xsl:value-of
select="$reference"/>;</xsl:otherwise>
          </xsl:choose>
        </xsl:variable>

        <xsl:variable name="tail">
          <xsl:call-template name=" restoreNumCharRefs">
            <xsl:with-param name="string"
select="substring-after($remainder, ';')"/>
          </xsl:call-template>
        </xsl:variable>

        <xsl:value-of select="concat($head, $entity, $tail)"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="$string"/>
      </xsl:otherwise>
    </xsl:choose>

</xsl:template>


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list


======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list




Current Thread
Keywords