[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
On 04.03.2010 18:39, Michael Kay wrote:
If no canonical definition of \w seems feasible and definitions that depend on either locale or a user's configuration file yield unexpected results for other users -- maybe resort to a \w that may be defined on a per-stylesheet basis. As I suggested in a former posting, one could use a stylesheet attribute with a (limited) regex syntax, e.g.:
<xsl:stylesheet ... word-constituents="[\p{Ll}\p{Lu}‑]">
When compiling the stylesheet, a preprocessor would statically expand \w, \W, and \b. Of course the word constituents must be thoroughly checked against the limited syntax prior to expansion, in order to ensure that otherwise valid regexes remain valid.
Tried this in Perl; the lookbehind didn't match ^ (beginning of line/string), while the lookahead matched $
Maybe this is different with Java. But if this aspect of lookbehind behaviour turns out to be implementation-dependent, the predictability constraint is violated.
In addition, as Liam pointed out, the '<' character in the regex attribute might irritate the XML parser.
And I think for commonplace situations such as word boundaries (whatever definition of 'word' you might choose), a crisp single-char escape as \b should be available (in addition to the powerful and flexible lookahead and lookbehind assertions).
This reminds me of the classic mod_rewrite motto:
``The great thing about mod_rewrite is it gives you all the configurability and flexibility of Sendmail. The downside to mod_rewrite is that it gives you all the configurability and flexibility of Sendmail.''
Or to cite another CS folklore: "Make the easy things easy and the hard things possible."
Of course if you doubt that the concept of a word boundary or a word constituent is an easy (in the sense of commonplace) one, the users will have to resort to the flexible lookahead mechanisms (once they are available in XSLT 2.1).
Gerrit
Re: [xsl] lookaheads in XSLT2 regexes
Subject: Re: [xsl] lookaheads in XSLT2 regexes From: "Imsieke, Gerrit, le-tex" <gerrit.imsieke@xxxxxxxxx> Date: Thu, 04 Mar 2010 22:30:05 +0100 |
On 04.03.2010 18:39, Michael Kay wrote:
I feel that \b is very much tied to a specific set of characters which might not be exactly the set you want. I'd be more comfortable providing general-purpose zero-width look-ahead and look-behind:
If no canonical definition of \w seems feasible and definitions that depend on either locale or a user's configuration file yield unexpected results for other users -- maybe resort to a \w that may be defined on a per-stylesheet basis. As I suggested in a former posting, one could use a stylesheet attribute with a (limited) regex syntax, e.g.:
<xsl:stylesheet ... word-constituents="[\p{Ll}\p{Lu}‑]">
When compiling the stylesheet, a preprocessor would statically expand \w, \W, and \b. Of course the word constituents must be thoroughly checked against the limited syntax prior to expansion, in order to ensure that otherwise valid regexes remain valid.
regex="(?<=\P{L})\p{{Lu}}{{2,}}(?=\P{L})"
Tried this in Perl; the lookbehind didn't match ^ (beginning of line/string), while the lookahead matched $
Maybe this is different with Java. But if this aspect of lookbehind behaviour turns out to be implementation-dependent, the predictability constraint is violated.
In addition, as Liam pointed out, the '<' character in the regex attribute might irritate the XML parser.
And I think for commonplace situations such as word boundaries (whatever definition of 'word' you might choose), a crisp single-char escape as \b should be available (in addition to the powerful and flexible lookahead and lookbehind assertions).
This reminds me of the classic mod_rewrite motto:
``The great thing about mod_rewrite is it gives you all the configurability and flexibility of Sendmail. The downside to mod_rewrite is that it gives you all the configurability and flexibility of Sendmail.''
Or to cite another CS folklore: "Make the easy things easy and the hard things possible."
Of course if you doubt that the concept of a word boundary or a word constituent is an easy (in the sense of commonplace) one, the users will have to resort to the flexible lookahead mechanisms (once they are available in XSLT 2.1).
A compromise will be (as suggested above): - allow concise \b and \w syntax in the regexes, - per-stylesheet means to redefine the default word constituent expression
Gerrit
which seems far more powerful.
Regards,
Michael Kay http://www.saxonica.com/ http://twitter.com/michaelhkay
-----Original Message----- From: Imsieke, Gerrit, le-tex [mailto:gerrit.imsieke@xxxxxxxxx] Sent: 04 March 2010 17:12 To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: Re: [xsl] lookaheads in XSLT2 regexes
Dear Liam,
Thanks for promoting the \b case. As an illustration for \b's usefulness, let me show how I tag acronyms for a recent project:
<xsl:template match="text()" mode="majuscules"> <xsl:analyze-string select="." regex="(^|[\p{{P}}\p{{Z}}\p{{C}}])(\p{{Lu}}{{2,}})([\p{{P}}\p{ {Z}}\p{{C}}]|$)"> <xsl:matching-substring> <xsl:value-of select="regex-group(1)"/> <span class="majusc"> <xsl:value-of select="regex-group(2)"/> </span> <xsl:value-of select="regex-group(3)"/> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="."/> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template>
With (a reasonably defined) \b, this could be simplified to
<xsl:template match="text()" mode="majuscules"> <xsl:analyze-string select="." regex="\b\p{{Lu}}{{2,}}\b"> <xsl:matching-substring> <span class="majusc"> <xsl:value-of select="."/> </span> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="."/> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template>
Please note that \b should not only match the \w/\W boundary, but also the beginning or end of the string (or line, when the 'm' flag is in force). Speaking of the 'm' flag, and in Michael's direction: I regard \b as much more useful than the 'm' flag when processing XML.
Gerrit
On 04.03.2010 06:59, Liam R E Quin wrote:On Wed, 2010-03-03 at 21:27 +0000, Michael Kay wrote:the mostOn the subject of \b I'll note we do have \W and \w
So we do, I overlooked that. And we define it a little differently from Perl:
[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]
So for example "+" is regarded as part of a word, while "-" isn't. Which strikes me as totally useless, to be honest.
I agree.
We could fix that for XPath 2.1 I think. I'm not sure whatuseful fix would be, I admit.probably work for
The Perl definition of "alphanumeric" plus "_" would\w, if one took alphnumeric to mean Letters|Numbers,\p{L}|\p{N}, andexpressions;is coincidentally closer to what you get in Perl if you do use locale; and your locale is (say) en_UK.UTF8, as it's then the same as the POSIX fragment [[:alpha:][:digit:]_]
There are lots of things that could be added to regularbut \b is hard to emulate, useful, and also we seem to havea ratherodd \w. If \w is there, I think \b was omitted by mistake.Or that\w was included by mistake!
Liam
-- Gerrit Imsieke Geschdftsf|hrer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930
Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt, Dr. Reinhard Vvckler
-- Gerrit Imsieke Geschdftsf|hrer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930
Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt, Dr. Reinhard Vvckler
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] lookaheads in XSLT2 regex, Michael Kay | Thread | Re: [xsl] lookaheads in XSLT2 regex, Dave Pawson |
Re: [xsl] Pattern Substring, Wendell Piez | Date | Re: [xsl] XSLT for Mashups, Florent Georges |
Month |