[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

RE: [xsl] Over 300 MB XML file and XSLT or XQuery

Subject: RE: [xsl] Over 300 MB XML file and XSLT or XQuery
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 13 Jan 2005 15:45:01 -0000

> To Michael Kay. Performance is not an issue. I am very
> new to XQuery. I would like to get my hands dirty with
> XQuery to learn a new trick of the trade but would
> like to follow technically correct approach to solve
> this kind of problem.
> Lets assume I have solved the big XML file problem and
> now given a text node, I need to search for this text
> in the tens of thousands of small xml or html files,
> generate stats like where it was found, how many times
> etc. and if not found generate meaningful logs. I can
> write Java classes if necessary.
> I would want to avoid converting small files into one
> large file. I was thinking about treating collection
> of all small files as an XML database and use Xquery.

In Saxon, if you use the doc() or document() function, then the file will be
loaded into memory, and will stay in memory until the end of the run, just
in case it's referenced again. So you will hit the same memory problem with
lots of small files as with one large file - worse, in fact, since there is
a significant per-document overhead.

However, there's a workaround: an extension function
saxon:discard-document() that causes a document to be discarded from memory
by the garbage collector as soon as there are no more references to it. So
you should be able to do a serial search of a large collection of documents
something like this (let's assume $uris is a sequence of strings holding the
document URIs):


for $u in $uris
  let $doc := saxon:discard-document(doc($u)) 
  if (my:condition($doc))
    then <match uri="{$u}"/>
    else <no-match uri="{$u}"/>

XSLT 2.0:

<xsl:for-each select="for $u in $uris return
  <xsl:when test="my:condition(.)">
    <match uri="{document-uri(.)}"/>
    <no-match uri="{document-uri(.)}"/>

There's no real difference between the XSLT and XQuery solutions, it's just
a different surface syntax.

If the files are in a directory structure, then you should be able to read
the directory structure directly by calling the relevant Java methods from
your XSLT or XQuery code.

See also:

Michael Kay

Current Thread