[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] library for parsing RTF


Subject: Re: [xsl] library for parsing RTF
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Sun, 27 Jun 2010 14:54:59 -0700

> Dimitre Novatchev seems to be the expert on writing parsers in XSLT. Perhaps
his next project could be a parser-generator (aka compiler-compiler) - a
> program that takes a BNF description of the grammar you want to parse, and
generates an XSLT stylesheet/library to do the parsing.

At the time I considered this and decided against it. Too much effort
for an application that would be used very rarely -- at design time
only. Once this tool creates the required parsing tables for the
generic LR1 parser, it is never used at runtime. Suitable
compiler-compiler systems already exist that can be used.

I actually modified Berkeley YACC. The modification works exactly as
the original system with a single addition. The original functionality
is to accept a BNF description of the grammar you want to parse, and
to generate parsing tables for input at run-time to a general
table-driven LR1 parser. With my addition it now has an option to
output these parsing tables in XML format, so that the XSLT
implementation of the generic parser can use them as input.

The tool is called YACCX and has been available for download for a few
years from the FXSL CVS.

If someone is interested to see how the parsing tables generated by
YACCX look in XML format, here is a link to the parsing tables for
JSON:

    http://fxsl.cvs.sourceforge.net/viewvc/fxsl/fxsl-xslt2/data/parseTables-J
ason.xml?revision=1.1&view=markup

The parser for JSON is here:

   http://fxsl.cvs.sourceforge.net/viewvc/fxsl/fxsl-xslt2/f/func-json-documen
t.xsl?revision=1.11&view=markup

Of notable interest is how the generic parser (f:lrParse() ) is used:

19    <xsl:variable name="vparseResult">
20            <xsl:sequence select=
21             "f:lrParse($vJasonPPTables,
22                             $pstrJson,
23                             f:lexer-JSON(),
24                             f:OnJSONRuleReduced()
25                        )
26                        /computedValue/node()
27      "
28             />
29     </xsl:variable>

and also the RegEx used by the lexical analyzer -- look at $vRegExJSON
defined in lines: 236 - 256:

236  <xsl:variable name="vRegExJSON" as="xs:string">
237    ([\s]*)          <!-- Skip leading whitespace -->
238                     <!-- Followed by: -->
239    (
240      ("[^"\\]*( ( ((\\[\\/bfnrt"]) | (\\u([0-9A-Fa-f]{4}))
)[^"\\]*)*")) <!-- A string -->
241     |                           <!-- Or a Number -->
242       ((([-]?[0-9]+)?\.)?[-]?[0-9]+([eE][-+]?[0-9]+)? )
243     |
244       ((true)|(false)|(null)  <!-- Or true
245                                    or false or null -->
246          )
247
248        |
249          ([{},:\[\]])            <!-- Or one of these:
250                                    '{', '}', ':',
251                                    '[', ']' -->
252
253       )            <!-- These are all our token types -->
254       (.*)$        <!-- Only get the first token,
255                         Skip the rest for the future -->
256  </xsl:variable>


The parsing tables and parser/lexical analyzer for XPath 2.0 are also
available for anyone interested. Be warned that they are much bigger
and way too complex.


--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play



On Sun, Jun 27, 2010 at 2:12 PM, Michael Kay <mike@xxxxxxxxxxxx> wrote:
>
>> there is no library and it is not required:
>> since RTF is a textual format, you can use XSLT 2.0 regexp capabilities to
parse RTF
>
> For a language as rich as RTF, regular expressions are not going to get you
all that far: they are probably only suitable for writing the lexical analyzer
(or tokenizer).
>
> Dimitre Novatchev seems to be the expert on writing parsers in XSLT. Perhaps
his next project could be a parser-generator (aka compiler-compiler) - a
program that takes a BNF description of the grammar you want to parse, and
generates an XSLT stylesheet/library to do the parsing.
>
> Michael Kay
> Saxonica


Current Thread
Keywords