[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)


Subject: RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)
From: "Marc Portier" <mpo@xxxxxxxxxxxxxxxx>
Date: Mon, 14 Jan 2002 02:41:02 +0100

Hi Jeni,
(sorry for time passing, mailtrubbels, familyweekend, regular payed work)

and in every case you quys either think or type to fast (and it's more
likely to be both :-))

> > thinking about doing it... the reality of the regex engine is that
> > while you would like the tree to appear somewhere in memory to
> > select on, the way the regex engine will offer you the matchresult
> > will be more in an array like format.
>
> Yes, but when you talk about "the regexp engine" you're talking about
> a specific implementation with a specific API. I don't see any reason
> why a different implementation with a different API shouldn't offer
> you a different view.

YEZZ! a total agree here that is...
other hand, I kinda feel challenged to play the devils advocate in at least
pushing us in considering more viewpoints...

so let me at least try to build up some argumentation that is not based on
an implementation view of things (although it's my 2nd nature)...
This devils counterpart of your argumentation 'current implemenentation
should not burden the conception off', would be some 'conception shouldn't
get carried away purely based on notational ideas'

How did we get here? ...my bit of history-writing ;-)
- we introduced named regex'es as a shorthand for easier assembly of the
regex="" attribute of 'matcher'
(and as a nice side effect this could make available predefined and built-in
(or user defined libraries of) named-regexes that are cumbersome to write
down: fancynumber, html tags, davids-tech stuff, dateformats,...)
- for clarity of written selector-indexig accross the 'hidden' grouping
parenthesis (of the used named regex) we started looking for a more nested
notation that would extend the simple $1,2,3,... digit selector
- looking back at xpath expressions we (well, it was _your_ great idea)
found a 'most natural' notation for such extended selectors for use in xslt
environments
- and then, when clarifying/visualizing the rationale behind this xpath
notation and mapping it to some tree-representation the logical idea seemed
to be able to actually have this whole costruct really available to the
remainder of the xslt as a nodeset...
- we end up forgetting why we had matcher elements? (or as I wrote it: not
need them any more)

however, taking the time to reflect some...
to me there seems to be 2 aspects to the thing
- applying regex (with possibly nested notations \e(name) or :name:) to an
inputString, thus recognizing mathced group-regions on the string, and make
those available through some indexing (which is to be read as not only
sequential indexing but possibly some hierarchical notation related to the
nested regex notations)
- filling an output-template of to be constructed nodes (elements,
attributes, text, their names and values...) with the index-able
resultgroups we pick out of the regex-applying step

originally we talked only about nesting matchers... and as pointed out their
nesting was felt as being the clever way to have the table-returning
classical regexes return the much appreciated tree structures...
by introducing the nested regexs they by themselves get this power of
offering result-trees (rather then tables) pretty much like indeed (see
other replies in this ever running thread) yacc and the like build up
syntax-trees out of a fromal gramar notation that could be seen as nested
regex's (to some extend)


from a pure OOP/CBD feeling (meaning my brain fails me here, so I know it's
not scientific and thus doesn't classify as an argument) it *feels* right to
have 2 separate constructs for the 2 separate aspects.

having this as a basis observation, let me try to take up your challenge
(together with the echo of my devils-advocate-disclaimer: I don't even know
if I believe it myself :-)

>
> And I'd argue that we're going to need regular expression engines
> specific to XSLT anyway (for support in XSLT, I know that your

Jeni, with due respect (and I _did_ need to read your approach 3 times to
fully grasp the pure genious in it) people could argue in the end that what
you describe *is* a specific regex engine ;-)

> concerns aren't necessarily the same) - regular expression engines for
never mind our regexslt concerns, we have a more and more working thing in a
far more limited domain... hoping to be able to ship something soon
(probably a zip first, later on sourceforge and the like) to others dealing
with the same uptranslation use-case.
our participation in this discussion is helping us understanding more issues
we maybe didn't think of (you _are_ a great help) in return we hope to maybe
have helped this thought-train to get on the tracks

> XML Schema regular expressions aren't going to cover everything we
which isn't a lot more then extra constraints for validation, or is it?

> need for XSLT (e.g. start-string and end-string meta characters), and
> other regexp engines won't support the XML Schema stuff.
>
> But even having said that, perhaps you can get around it with some
> jiggery-pokery. Say you had $number defined as the regular expression:
>
>   ([0-9]+)(\.([0-9]+))?
>
> and you had a regular expression of:
>
>   [+-]?($number)([Ee]([+-][0-9]+))?
>
> Matching against:
>
>   -12.34E+5
>
> If you simply substitute in the $number expression to get a result
> expression:
>
>   [+-]?(([0-9]+)(\.([0-9]+))?)([Ee]([+-][0-9]+))?

that's pretty much what I had in mind from the start, and use offset kind of
tricks to map from original indices to nested indices... (or rather, vice
versa)

>
> Then run the match on that. The result of the match could come in
> pairs of start/end integers, giving the position of the start and end
> of the match as a whole and each subexpression, something akin to
> (assuming counting starts from 0):
>
>   ( (0, 9), (1, 6), (1, 3), (4, 6), (6, 9), (7, 9), (8, 9) )
>
> You can tell whether one subexpression is 'inside' another
> subexpression by the fact that its start position is equal to or
> greater than the start position of the parent position, and its end
> position is equal to or less than the start position of the parent
> position. So from that you should be able to construct the tree:
>
>   <rxp:match>
>     -
>     <rxp:match>
>       <rxp:match>12</rxp:match>.<rxp:match>34</rxp:match>
>     </rxp:match>
>     <rxp:match>E<rxp:match>+5</rxp:match></rxp:match>
>   </rxp:match>
>
yep, it's a bit of a hastle, but looks do-able

> You can work out which of these rxp:match elements should actually be
> a 'number' element by counting the number of open-brace characters
> before the variable reference in the regular expression. In the case
> of:
>
>   [+-]?($number)([Ee]([+-][0-9]+))?
>
> it's the first one, so the first rxp:match element (inside the global
> match) should be a number element instead:
>
>   <rxp:match>
>     -
>     <number>
>       <rxp:match>12</rxp:match>.<rxp:match>34</rxp:match>
>     </number>
>     <rxp:match>E<rxp:match>+5</rxp:match></rxp:match>
>   </rxp:match>
>
yep,
I'll be needing a fourth and fifth read before I actually get to
implementing it ;-), thanks for this it will get me working at some point as
named regexes is surely something that could augment our thing aswell...
altough I'm still doubting on the full tree build up (meaning: making stuff
accessible that wasn't in a parenthesis group)

On the approach itself and as stated above: while provig you *can* do it on
top of current regex engine style of working, I think there is room enough
for the hardcore regexengine people to find proof for doing this  better
directly based on the internals of their matching logic (and thus offer some
new style API... think it would become their logical todo after accepting
your \e()idea, no?

>
> I'll note that the *big* drawback with this algorithm is that it
> doesn't take into account nested subexpressions. Say you had something
> like:
>
>   $para   =>  \\para\{($phrase)\}
>   $phrase =>  (($bold)|($italic)|([^\\]*))*
>   $bold   =>  \\bold\{($phrase)\}
>   $italic =>  \\italic\{($phrase)\}
>
> to match a string like:
>
>   \para{\italic{this} is \bold{bold \italic{and italic}} text.}
>

need to check this, but we should be able to get round this in the matcher
approach by not callling the nested matcher when there is no more input to
match against...

> This is essentially David's problem. You can't just expand all the
> variable subexpressions because they'd go on for ever. But without
> expanding all the variable subexpressions, you can't successfully
> backtrack if you suddenly find that a variable subexpression isn't in
> fact matched. For example, if you had:

yep, only the regex of the nesting matcher should need some general version
of the to be matched text in its regex...
<matcher name="inner" regex="pre(.*)suf">
	<element name="match">
		<call-matcher name="inner" select-group="1" />
	</element>
</matcher>

the nesting should stop by itself at runtime cause the matcher runs out of
input-string
and the matcher element should be read like some for-each-match thing


>
>   $baz     =>  \\baz\{(($barfoo)|($barbar))\}
>   $barbar  =>  \\bar\{($bar)\}
>   $barfoo  =>  \\bar\{($foo)\}
>   $foo     =>  foo
>
> And a string like:
>
>   \baz{\bar{\bar{foo}}
>
> We want to get:
>
>   <baz>
>     \baz{
>     <barbar>\bar{<barfoo>\bar{<foo>foo</foo>}</barfoo>}</barbar>
>     }
>   </baz>
>
> We can start off by matching the string against $baz and working out
> that to match it, then the regular expression:
>
>   (($barfoo)|($barbar))\}
>
> must match the string:
>
>   \bar{\bar{foo}}}
>
> and that means that the following regular expression must match the
> string:
>
>   ((\\bar\{($foo)\})|(\\bar\{($bar)\}))\}
>
> but there's no way to tell whether the \bar at the beginning of the
> string should match the \\bar at the beginning of $barfoo or of
> $barbar.
>
> That might be an acceptable limitation, or it might be something that
> can be worked around, I'm not sure. I need to think about it some
> more.
>
> > the feeling I currently have is that this nested-regex to tree
> > vision directly has the power of circumventing the need for the
> > nested matcher structure alltogether... meaning with this kind of
> > tree view on the regex-match-result one can hand it over to the
> > regular-and-known xsl-node-juggling, I guess.
>
> Quite - that's what I think makes it so powerful. If you get a tree
> like the above, you could apply templates to it and so on. You have
> one big regular expression that gets directly translated into a tree,
> which you can then process using XSLT as usual.
>
yep... only my feeling is that you then do more work then is expected I
guess.
the nested matcher approach lets the user pick out what he/she needs

> >> I should note that nothing existing in XPath or XSLT automatically
> >> creates a tree in this way.
> >
> > And this would be an argument to keep the nested-matcher approach.
> > They behave more like the known templates that do create trees I
> > guess?
>
> Hey, why limit yourself to only doing things in the way that they're
> done now?!? :)
>
as before: a sure thing

> But I think I'm still not very clear on your nested matcher approach.
> Do you think you could use the examples above to illustrate how your
> nested matcher would work?
>

hope to have time tomorow to actually try regexslt out on the \para \bold
\italic example
haven't tried out anything like that up to know...

> Cheers,
>
> Jeni
>
> ---
> Jeni Tennison
> http://www.jenitennison.com/
>


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



Current Thread
Keywords