[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] Re: Re: Regular expression functions (Was: Re: comments on December F&O draft)


Subject: [xsl] Re: Re: Regular expression functions (Was: Re: comments on December F&O draft)
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Sat, 12 Jan 2002 23:50:29 -0800 (PST)

> > YACC is a parser generator for LALR(1) grammars (a subset of
> > context-free grammars), which describe languages much more powerful
> > than what regular expressions can describe.
> 
> Yep. For David's example, I think that's what we need.
> 
> (By the way, obviously I'm not a computer scientist and don't grasp
> all the theory about this [or indeed most things]. I'd appreciate it
> if you could explain terminology and acronyms like "LALR(1)" and
> "context-free grammars" - all the reading that I did seemed to assume
> you knew what they meant before hand :( Also, I might be using
> terminology wrongly, such that you misunderstand what I'm trying to
> say.)

Jeni,

more than 10 years ago I was teaching this to students... It is a serious course,
that is better not summarised in one post.

Long ago, I also did some work on implementing an algorithm in C (not inventing it)
for parsing natural language. I had to modify YACC to generate parsing tables with
support for ambiguities. The parsing table (actually they are two, but let's not go
into details) that YACC generates is a two-dimentional array of integers, where
generally the value of an element in this table specifies the new state of the
parser, to which it should go. The row number is the current state, the column
number is the code of another state or the code of a terminal symbol that's coming
from the input.

CF grammars are deterministic, they do not have ambiguities. This in practice means,
that each element in the parsing table is just a single integer -- from any state
and on any input there's just one new state to go into. There's just a single
(canonically ordered) parse tree for any sentence in a CF language.

On the other hand, natural language is ambiguous and a sentence often can have many
different parse trees (and meanings). This in practice means, that each element in
the parsing table for such language can potentially specify a list of new states.

In case you're interested in this, there are some CS courses available online, e.g.:

http://www.cs.umbc.edu/~squire/cs451_lect.html

And there are some really nice books:

"Compilers, Principles, Techniques, and Tools", by Alfred V. Aho,Ravi Sethi,Jeffrey
D. Ullman, ISBN: 0201100886

An online version of an old article in the Bell Journal describing in detail how
YACC works can be found at:
http://dinosaur.compilertools.net/yacc/index.html

All this stuff is really very interesting.



> >> Something along the lines of:
> >> 
> >>   $row      => ($frac)|($sqrt)|($expr)
> >>   $frac     =>  \\frac\{($row)\}\{($row)\}
> >>   $sqrt     =>  \\sqrt\{($row)\}
> >>   $expr     => ($times)|(($operand)?(\s*($operator)\s*($operand))+)
> >>   $times    => ($operand){2,}
> >>   $operand  => ($row)|($sup)|($number)|($ident)
> >>   $sup      => ($operand)\^($operand)
> >>   $operator => \\pm|\-
> >>   $number   => -?[0-9]+(\.[0-9]+)?
> >>   $ident    => [a-z][a-z0-9]*
> >
> > This is not a regular grammar but a context-free grammar (although
> > it is not formally defined and I have to guess which exactly is the
> > set of terminal and non-terminal symbols, as well as which is the
> > starting symbol).
> 
> The regular expression syntax that I was aiming for here is an
> extension of the regular expression syntax from XML Schema. The only
> extension in evidence here is that it allows you to insert named
> subexpressions. For that, I was using the syntax that I introduced
> earlier in this thread: "(" "$" subexpression-name ")" (so ($times) is
> a reference to a named subexpression (or non-terminal in other words)
> that's bound to the name 'times').
> 
> Perhaps I shouldn't be calling these "regular expressions", but they
> seemed roughly equivalent to Lex regular expressions from what I read
> at http://dinosaur.compilertools.net/lex/index.html (though there they
> use the syntax "{" non-terminal "}" to refer to a non-terminal).

Just try to feed the above to any lexical analyser and see what happens... :o)

> I was aiming for something where the terminals are regular expressions
> that don't contain references (so $operator, $number and $ident).
> 
> > It seems that $row, $frac, $sqrt, $expr and $operand are the
> > non-terminal symbols. In this case this is a context-free grammar
> > that does not describe a regular language (a language that can be
> > described/generated by a set of regular expressions). This is so,
> > because in a set of rules describing a regular language the rules
> > can only have terminal symbols and then one variable. The rule for
> > $expr is in clear violation of this principle.
> 
> Sorry, I'm having difficulty keeping up. Do you mean that you cannot
> articulate something like, for example:
> 
>   AndExpr  :=  Expr "and" Expr

In case Expr is a non-terminal, this is not a regular expression. You're going in
the realm of defining and parsing context free (CF) languages.

> if you want to describe a regular language, because it contains more
> than one variable (which I take to mean a reference to a
> non-terminal)?  (Again, finding a readable description of a "regular
> language" is hard to do.)

Just use the CS course pointed by the link above, but be patient -- this is not an
easy reading.

> >> I am fairly convinced that it's possible to create an
> >> implementation to do this, based on the fact that there are plenty
> >> of lexers out there that do roughly the same thing.
> >
> > As shown above, what would be needed is not a "lexer" but a parser
> > for CF grammars.
> >
> > In case a parser generator is used, it has to be fed with the rules
> > of the grammar. It then generates a table that guides the parsing
> > process. This table typically shows the action to be taken and the
> > next state of the parser, if it is in a given state, the top of the
> > stack contains a given symbol and the input contains a given
> > terminal symbol. This table-driven parser will need a lexical
> > analizer to call when each next non-terminal from the input is
> > needed.
> >
> > You're suggesting this to be done dynamically at transformation time.
> 
> Or that the table is built up during compilation of the stylesheet
> from static declarations in the stylesheet.

At present the predominant majority of XSLT processors are interpreters, not
compilers. This means that such table generation and compilation will have to be
repeatedly performed (because their results are not persisted) every time the
transformation is applied. Seems too much a cost...

I'd rather have a function library, including a parser-generator function that
produces from a given set of syntax rules the corresponding parsing tables (as XML
that will be persisted just once and used many times!) and parser function that will
use as parameters the necessary parsing tables, a reference to a lexical analyzer
function, and the input string to feed to the lex. analiser.

Cheers,
Dimitre Novatchev.

__________________________________________________
Do You Yahoo!?
Send FREE video emails in Yahoo! Mail!
http://promo.yahoo.com/videomail/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



Current Thread
Keywords