[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

RE: [xsl] optimization for very large, flat documents


Subject: RE: [xsl] optimization for very large, flat documents
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Wed, 19 Jan 2005 09:16:56 -0000

> I'm trying to process a very large (600 MB) flat XML document, a
> bibliography where each of the 400,000 entries is completely 
> independent
> of the others.  According to the Saxon web site and mailing 
> list, it'll
> take approx. 5-10 times that (3 GB) to hold the document tree 
> in memory,
> which is impractical.  The Saxon mailing list also has some tips about
> how to accomplish this, but my question is: Why doesn't XSLT provide a
> way to specify that a matched node can be processed 
> independently of its
> predecessor and successor siblings?  Alternatively, couldn't an XSLT
> processor infer that from the complete absence of XPath 
> expressions that
> refer to predecessor and successor siblings?

I think the reason that XSLT vendors have not tried this approach is:

(a) there are rather few stylesheets where the technique works, and can be
seen statically to work. It's not enough that all path expressions should
select downwards: there must be no absolute path expressions, no global
variables that select from the initial context node, no keys, and probably
quite a few other conditions besides.

(b) for such stylesheets, a completely different run-time approach is
needed: effectively, a different XSLT processor.

I think that in practice if you want to do serial transformation then a
functional language is not the right answer: if you can only look at each
piece of input data once, then you need the ability to remember what you
have seen, so you need a procedural language with updatable memory. That's
why STX was invented.

However, I think there is scope for someone to package up the idea of
running an XSLT transform on each "record" in a large file, and then
recombining the results.

Michael Kay
http://www.saxonica.com/ 


Current Thread
Keywords