[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)


Subject: RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)
From: "Steven Noels" <stevenn@xxxxxxxxxxxxxxxx>
Date: Wed, 9 Jan 2002 22:04:15 +0100

> -----Original Message-----
> From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> [mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx]On Behalf Of Michael Kay
> Sent: woensdag 9 januari 2002 12:40
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: RE: Regular expression functions (Was: Re: [xsl] comments on
> December F&O draft)

> I'm interested in your exploration of the use-cases for
> regexp matching and
> possible XSLT constructs to support those use cases, though
> so far I've had
> difficulty following the "make-it-up-as-you-go-along" style of
> specification!
>
> Mike Kay

We are currently working on a little tool (packaged as a Cocoon
generator, an Ant task and a CLI app) that is more or less
Omnimark-like, i.e. it enables you to 'uptranslate' a non-XML document
(HTML, delimited ASCII, ...) to an XML document.

We baptised it Regexslt since it borrows (a little bit) from the XSLT
language design.

It is based on the Jakarta ORO regex library.

Using the input document (can be a URL)
http://www.bloomberg.com/bbn/technology.html and this regexslt
specification:

<?xml version="1.0" encoding="UTF-8"?>
<regexslt xmlns="http://outerx.org/ns/regexslt/transform/1.0">
  <element name="feed">
    <element name="title">
      <text>Bloomberg &gt; Technology</text>
    </element>
    <element name="url">
      <text>http://www.bloomberg.com/bbn/technology.html</text>
    </element>
    <call-matcher name="feeddate"/>
    <call-matcher name="items"/>
  </element>
  <matcher
regex="CLASS=&quot;story3&quot;&gt;([^&lt;]+)&lt;BR&gt;&lt;/SPAN&gt;&lt;
/FONT&gt;&lt;/STRONG&gt;&lt;FONT\sCOLOR=&quot;#333333&quot;\sFACE=&quot;
sans-serif,\sarial&quot;&gt;&lt;SPAN\sCLASS=&quot;story&quot;&gt;([^&lt;
]+)&amp;nbsp;(.+)&lt;A\sHREF=&quot;([^&quot;]+)&quot;&gt;More"
name="items">
    <element name="item">
      <element name="blurb">
        <value-of select-group="1"/>
      </element>
      <element name="body">
        <value-of select-group="2"/>
      </element>
      <element name="url">
        <value-of select-group="4"/>
      </element>
    </element>
  </matcher>
  <matcher
regex="&lt;SPAN\sCLASS=&quot;date&quot;&gt;([^&lt;]+)&lt;/SPAN&gt;"
name="feeddate">
    <element name="date">
      <value-of select-group="1"/>
    </element>
  </matcher>
</regexslt>

it is transformed into

<?xml version="1.0" encoding="UTF-8"?>
<feed>
  <title>Bloomberg &gt; Technology</title>
  <url>http://www.bloomberg.com/bbn/technology.html</url>
  <date>Wed, 09 Jan 2002, 3:48pm EST</date>
  <item>
    <blurb>Oracle, BEA, Software Stocks Surge After SAP Says 2001 Sales
Beat Forecast</blurb>
    <body>The shares of	Oracle Corp., BEA Systems Inc. and other
software companies surged	after SAP AG, the largest maker of
business-management programs,	said it surpassed a lowered 2001 sales
forecast.</body>

<url>http://quote.bloomberg.com/fgcgi.cgi?ptitle=Technology%20News&amp;s
1=blk&amp;tp=ad_topright_tech&amp;T=markets_bfgcgi_content99.ht&amp;s2=a
d_right1_technology&amp;bt=ad_position1_technology&amp;middle=ad_frame2_
technology&amp;s=APDyfihUCT3JhY2xl</url>
  </item>
[...]
</feed>

One of the things which doesn't work well currently is the specification
of the regex as an attribute to the <matcher> element. We will avoid
this by putting the regex inside a CDATA section of a <regex> subelement
(will be optional, we are testing this right now). Not sure whether this
is good practice, advice welcome. It is only partially related to this
discussion of course.

We plan on releasing regexslt "when it's ready" (weeks, not months)
under a liberal license (ASF). People who are willing to play around
with it can contact me. There's an XML Schema for the language also (we
found validation of the transformationsheet very important).

But we would much more appreciate criticism and suggestions from the
people on this thread :-)

Pointers to other regex libraries which are more up to par with Perl
regexes would be welcome, too.

Regards,

Steven Noels
http://outerthought.org/
(+32)478 292900


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



Current Thread
Keywords