[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

RE: [xsl] use XSLT or XQuery in Saxon?

Subject: RE: [xsl] use XSLT or XQuery in Saxon?
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 6 Jan 2005 09:47:06 -0000

> I have extremely large (over 300 MB) XML file and tens
> of thousands of small xml files generated after
> applying various XSLT on the one big XML file.

You're right, 300Mb *is* large (I had someone recently ask how to process a
large file and it turned out to be 300Kb). You have a choice between
spending money on lots of memory (say 2Gb, but it depends on the actual
structure) and doing more development work to split the task up. This
applies equally whether you are using XSLT or XQuery - in Saxon these are
really just different surface syntaxes for the same processing engine.
> I am using Saxon for XSLT and will be using it also
> for XQuery.
> Is Xquery or XSLT is better solution for this problem?
> Query each text node in the big xml file and verify
> that this content is present in one of the results xml
> files.

Clearly this requires a better algorithm than searching all the small files
once for each text node in the large file.

One solution is to aggregate the small files into a single document and
index it using a key. This would require XSLT, because keys are not
available in XQuery. Some XQuery implementations might do an indexed join
automatically, but Saxon doesn't (yet). Of course, aggregating the small
files means even more memory.

Another solution, again dependent on XSLT, is to use grouping. This doesn't
require the small documents to be aggregated into a single document. If you
take the union of the text nodes in the large document and the values in the
small documents, and then do grouping, a group of size 1 indicates a value
that is present in one file and not the other.

However, if performance is really important (you don't actually say), I
think I would be inclined to write this "by hand" as a SAX application. It
will probably be an order of magnitude faster that way.

In the past it was taken for granted that to handle 300Mb of data you needed
a database. I wouldn't rule this option out: it largely depends on where the
data comes from and what its lifecycle looks like. Databases are designed
specifically for this kind of job.

Michael Kay

 Based on this information generate a report
> that shows which content is present and in which file
> and in a separate section which content was not found
> in result xml files and also show this content parent
> element or other markup to indicate its position in
> the big xml file.
> All the small xml files are stored as flat files in
> various directories on Windows File system although
> most files are in one directory. The big XML file is
> fairly complex with multiple levels of nesting
> elemenents.
> Any comments or suggestions?
> Thank you
> __________________________________ 
> Do you Yahoo!? 
> Yahoo! Mail - now with 250MB free storage. Learn more.
> http://info.mail.yahoo.com/mail_250

Current Thread