[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] analyze-string help?


Subject: Re: [xsl] analyze-string help?
From: Graydon <graydon@xxxxxxxxx>
Date: Sun, 10 Jun 2012 12:20:40 -0400

On Sun, Jun 10, 2012 at 12:04:58PM -0400, Syd Bauman scripsit:
> > I think maybe it worked because I had it at the end of the pattern
> > and then later added additional characters. So I think I went from
> > [A-Za-z0-9 -] to this [A-Za-z0-9 -,./]
> 
> It was accidental? And here I thought it was a clever way to catch
> gnarly characters. The hyphen in the 2nd regexp means "from space
> (U+0020) to comma (U+002C)", i.e. expresses a range that matches the
> same characters [ !"#$%&'()*+,] matches. Many of these characters are
> a pain to type into an XSLT regexp, and thus a range like this seemed
> like a nice way to catch them.

Well, except that it's both subtle and clever, those banes of
maintainability.

One of the things I am very glad went into XSLT regular expressions are
the Unicode character categories; if you want (for example),
punctuation, it's "\p{P}", so I might write the provided atom definition
as:

[\p{L}\p{Nd}\p{P}]

("Unicode character category letters", "Unicode character category
numbers, subcategory digits", "Unicode character category punctuation".)

Upper-case P means "everything not", so you can neatly express things
like "\P{Pd}", "any character that is not some kind of dash".

In my ideal world the syntax would evolve so you could constrain the
categories -- "\p{Pd except '-'}", "any character that is some kind
of dash except for U+002D "hyphen-minus", for example -- since that
would make this even more useful for functions that take regular
expressions such as tokenize().

-- Graydon

-- 
Graydon Saunders        XML tools and processes for information delivery.
graydon@xxxxxxxxx


Current Thread
Keywords