[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] lookaheads in XSLT2 regexes


Subject: Re: [xsl] lookaheads in XSLT2 regexes
From: "Imsieke, Gerrit, le-tex" <gerrit.imsieke@xxxxxxxxx>
Date: Thu, 04 Mar 2010 22:30:05 +0100

On 04.03.2010 18:39, Michael Kay wrote:
I feel that \b is very much tied to a specific set of characters which might
not be exactly the set you want. I'd be more comfortable providing
general-purpose zero-width look-ahead and look-behind:

If no canonical definition of \w seems feasible and definitions that depend on either locale or a user's configuration file yield unexpected results for other users -- maybe resort to a \w that may be defined on a per-stylesheet basis. As I suggested in a former posting, one could use a stylesheet attribute with a (limited) regex syntax, e.g.:
<xsl:stylesheet ... word-constituents="[\p{Ll}\p{Lu}&#x2011;]">


When compiling the stylesheet, a preprocessor would statically expand \w, \W, and \b. Of course the word constituents must be thoroughly checked against the limited syntax prior to expansion, in order to ensure that otherwise valid regexes remain valid.


regex="(?<=\P{L})\p{{Lu}}{{2,}}(?=\P{L})"

Tried this in Perl; the lookbehind didn't match ^ (beginning of line/string), while the lookahead matched $
Maybe this is different with Java. But if this aspect of lookbehind behaviour turns out to be implementation-dependent, the predictability constraint is violated.
In addition, as Liam pointed out, the '<' character in the regex attribute might irritate the XML parser.
And I think for commonplace situations such as word boundaries (whatever definition of 'word' you might choose), a crisp single-char escape as \b should be available (in addition to the powerful and flexible lookahead and lookbehind assertions).


This reminds me of the classic mod_rewrite motto:

``The great thing about mod_rewrite is it gives you all the configurability and flexibility of Sendmail. The downside to mod_rewrite is that it gives you all the configurability and flexibility of Sendmail.''

Or to cite another CS folklore: "Make the easy things easy and the hard things possible."

Of course if you doubt that the concept of a word boundary or a word constituent is an easy (in the sense of commonplace) one, the users will have to resort to the flexible lookahead mechanisms (once they are available in XSLT 2.1).

A compromise will be (as suggested above):
- allow concise \b and \w syntax in the regexes,
- per-stylesheet means to redefine the default word constituent expression

Gerrit


which seems far more powerful.


Regards,

Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay

-----Original Message-----
From: Imsieke, Gerrit, le-tex [mailto:gerrit.imsieke@xxxxxxxxx]
Sent: 04 March 2010 17:12
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: [xsl] lookaheads in XSLT2 regexes

Dear Liam,

Thanks for promoting the \b case. As an illustration for \b's
usefulness, let me show how I tag acronyms for a recent project:

    <xsl:template match="text()" mode="majuscules">
      <xsl:analyze-string select="."
regex="(^|[\p{{P}}\p{{Z}}\p{{C}}])(\p{{Lu}}{{2,}})([\p{{P}}\p{
{Z}}\p{{C}}]|$)">
        <xsl:matching-substring>
          <xsl:value-of select="regex-group(1)"/>
          <span class="majusc">
            <xsl:value-of select="regex-group(2)"/>
          </span>
          <xsl:value-of select="regex-group(3)"/>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:value-of select="."/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:template>

With (a reasonably defined) \b, this could be simplified to

    <xsl:template match="text()" mode="majuscules">
      <xsl:analyze-string select="." regex="\b\p{{Lu}}{{2,}}\b">
        <xsl:matching-substring>
          <span class="majusc">
            <xsl:value-of select="."/>
          </span>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:value-of select="."/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:template>

Please note that \b should not only match the \w/\W boundary,
but also the beginning or end of the string (or line, when
the 'm' flag is in force). Speaking of the 'm' flag, and in
Michael's direction: I regard \b as much more useful than the
'm' flag when processing XML.

Gerrit



On 04.03.2010 06:59, Liam R E Quin wrote:
On Wed, 2010-03-03 at 21:27 +0000, Michael Kay wrote:
On the subject of \b I'll note we do have \W and \w

So we do, I overlooked that. And we define it a little differently from Perl:

[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]

So for example "+" is regarded as part of a word, while "-" isn't.
Which strikes me as totally useless, to be honest.

I agree.


We could fix that for XPath 2.1 I think. I'm not sure what
the most
useful fix would be, I admit.

The Perl definition of "alphanumeric" plus "_" would
probably work for
\w, if one took alphnumeric to mean Letters|Numbers,
\p{L}|\p{N}, and
is coincidentally closer to what you get in Perl if you do
      use locale;
and your locale is (say) en_UK.UTF8, as it's then the same as the
POSIX fragment [[:alpha:][:digit:]_]

There are lots of things that could be added to regular
expressions;
but \b is hard to emulate, useful, and also we seem to have
a rather
odd \w. If \w is there, I think \b was omitted by mistake.
Or that
\w was included by mistake!

Liam


-- Gerrit Imsieke Geschdftsf|hrer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek, Thomas
Schmidt, Dr. Reinhard Vvckler


-- Gerrit Imsieke Geschdftsf|hrer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vvckler


Current Thread
Keywords