[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] OT: request help with string-based/ RegEx problem


Subject: [xsl] OT: request help with string-based/ RegEx problem
From: "Jakob" <jakob@xxxxxxxx>
Date: Thu, 1 Apr 2004 12:04:18 +0200 (CEST)

Hello everybody,

I have the following problem:

I need to find any one and two-character words in my
document, like "L", "GG", "Bz", "mm", but also entities
representing a character like this "&#8711;" (up-turned
delta) etc.  As any such combination is possible, this
would make a very long list.  Once found, I'd like to
surround this strings with an element each, like this: 
<sym>L</sym>, <sym>GG</sym>, <sym>&#8711;</sym> ...

Furthermore, I am not interested in these character
sequences when they are found inside certain elements, for
example: <xref refid="abc">Part B</xref>, I do not want to
tag the "B".  There's a limited number of such exclusions.

My understanding is that handling this in XSLT (1.0, at
least) is not possible.  I cannot currently switch to 2.0,
so I thought the best way would be to use regular
expressions (as an ant task) that accomplish the same
goal.

While I have no trouble creating a regex that finds me all
one or two-character words, I have not found a way yet to
express the contextual constraints.

The following is a "pseudo regex" expressing this idea:

------8<------
not following <xref[^>]+> or <syd1>[^>]+> ...
  (.*)
  <                              ==> word start
  ([a-zA-Z] | [a-zA-Z][a-zA-Z])  ==> target
  >                              ==> word end
  (.*)
not before </xref> or </syd1> ...
------8<------

Again, I am conscious this can be regarded as off-topic. 
And also, if there's an XSL-based solution, or a different
approach altogether, I am happy to learn.

Thanks in advance.

Cheers,
Jakob.


Current Thread
Keywords