[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] lookaheads in XSLT2 regexes

Subject: Re: [xsl] lookaheads in XSLT2 regexes
From: "Imsieke, Gerrit, le-tex" <gerrit.imsieke@xxxxxxxxx>
Date: Thu, 04 Mar 2010 22:30:05 +0100

On 04.03.2010 18:39, Michael Kay wrote:

I feel that \b is very much tied to a specific set of characters which might
not be exactly the set you want. I'd be more comfortable providing
general-purpose zero-width look-ahead and look-behind:

If no canonical definition of \w seems feasible and definitions that depend on either locale or a user's configuration file yield unexpected results for other users -- maybe resort to a \w that may be defined on a per-stylesheet basis. As I suggested in a former posting, one could use a stylesheet attribute with a (limited) regex syntax, e.g.: <xsl:stylesheet ... word-constituents="[\p{Ll}\p{Lu}‑]">

When compiling the stylesheet, a preprocessor would statically expand \w, \W, and \b. Of course the word constituents must be thoroughly checked against the limited syntax prior to expansion, in order to ensure that otherwise valid regexes remain valid.

regex="(?<=\P{L})\p{{Lu}}{{2,}}(?=\P{L})"

Tried this in Perl; the lookbehind didn't match ^ (beginning of line/string), while the lookahead matched $ Maybe this is different with Java. But if this aspect of lookbehind behaviour turns out to be implementation-dependent, the predictability constraint is violated. In addition, as Liam pointed out, the '<' character in the regex attribute might irritate the XML parser. And I think for commonplace situations such as word boundaries (whatever definition of 'word' you might choose), a crisp single-char escape as \b should be available (in addition to the powerful and flexible lookahead and lookbehind assertions).

This reminds me of the classic mod_rewrite motto:

``The great thing about mod_rewrite is it gives you all the configurability and flexibility of Sendmail. The downside to mod_rewrite is that it gives you all the configurability and flexibility of Sendmail.''

Or to cite another CS folklore: "Make the easy things easy and the hard things possible."

Of course if you doubt that the concept of a word boundary or a word constituent is an easy (in the sense of commonplace) one, the users will have to resort to the flexible lookahead mechanisms (once they are available in XSLT 2.1).

A compromise will be (as suggested above):
- allow concise \b and \w syntax in the regexes,
- per-stylesheet means to redefine the default word constituent expression

Gerrit

which seems far more powerful.

Regards,

Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay

-----Original Message-----
From: Imsieke, Gerrit, le-tex [mailto:gerrit.imsieke@xxxxxxxxx]
Sent: 04 March 2010 17:12
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: [xsl] lookaheads in XSLT2 regexes

Dear Liam,

Thanks for promoting the \b case. As an illustration for \b's
usefulness, let me show how I tag acronyms for a recent project:

    <xsl:template match="text()" mode="majuscules">
      <xsl:analyze-string select="."
regex="(^|[\p{{P}}\p{{Z}}\p{{C}}])(\p{{Lu}}{{2,}})([\p{{P}}\p{
{Z}}\p{{C}}]|$)">
        <xsl:matching-substring>
          <xsl:value-of select="regex-group(1)"/>
          <span class="majusc">
            <xsl:value-of select="regex-group(2)"/>
          </span>
          <xsl:value-of select="regex-group(3)"/>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:value-of select="."/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:template>

With (a reasonably defined) \b, this could be simplified to

    <xsl:template match="text()" mode="majuscules">
      <xsl:analyze-string select="." regex="\b\p{{Lu}}{{2,}}\b">
        <xsl:matching-substring>
          <span class="majusc">
            <xsl:value-of select="."/>
          </span>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:value-of select="."/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:template>

Please note that \b should not only match the \w/\W boundary,
but also the beginning or end of the string (or line, when
the 'm' flag is in force). Speaking of the 'm' flag, and in
Michael's direction: I regard \b as much more useful than the
'm' flag when processing XML.

Gerrit

On 04.03.2010 06:59, Liam R E Quin wrote:

On Wed, 2010-03-03 at 21:27 +0000, Michael Kay wrote:
On the subject of \b I'll note we do have \W and \w
So we do, I overlooked that. And we define it a little differently
from
Perl:
[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]
So for example "+" is regarded as part of a word, while "-" isn't.
Which strikes me as totally useless, to be honest.
I agree.

We could fix that for XPath 2.1 I think. I'm not sure what

the most

useful fix would be, I admit.

The Perl definition of "alphanumeric" plus "_" would

probably work for

\w, if one took alphnumeric to mean Letters|Numbers,

\p{L}|\p{N}, and

is coincidentally closer to what you get in Perl if you do
      use locale;
and your locale is (say) en_UK.UTF8, as it's then the same as the
POSIX fragment [[:alpha:][:digit:]_]

There are lots of things that could be added to regular

expressions;

but \b is hard to emulate, useful, and also we seem to have

a rather

odd \w. If \w is there, I think \b was omitted by mistake.

Or that

\w was included by mistake!

Liam


--
Gerrit Imsieke
Geschdftsf|hrer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341
355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx,
http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek, Thomas
Schmidt, Dr. Reinhard Vvckler


--
Gerrit Imsieke
Geschdftsf|hrer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vvckler

Current Thread
RE: [xsl] lookaheads in XSLT2 regexes, (continued) Michael Kay - 3 Mar 2010 21:28:01 -0000 Liam R E Quin - 4 Mar 2010 06:00:05 -0000 Imsieke, Gerrit, le-tex - 4 Mar 2010 17:12:57 -0000 Michael Kay - 4 Mar 2010 17:39:37 -0000 Imsieke, Gerrit, le-tex - 4 Mar 2010 21:30:46 -0000 <= Dave Pawson - 2 Mar 2010 07:50:04 -0000

<- Previous	Index	Next ->
RE: [xsl] lookaheads in XSLT2 regex, Michael Kay	Thread	Re: [xsl] lookaheads in XSLT2 regex, Dave Pawson
Re: [xsl] Pattern Substring, Wendell Piez	Date	Re: [xsl] XSLT for Mashups, Florent Georges
	Month

Keywords

xml
xpath
xslt

Re: [xsl] lookaheads in XSLT2 regexes

Products

Features

Shop

Resources

Support

Company