[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
RE: Efficient Stylesheets for reordering
Subject: RE: Efficient Stylesheets for reordering From: Jose Alberto Fernandez <JFernandez@xxxxxxxxxxx> Date: Wed, 8 Nov 2000 13:54:33 -0800 |
> From: Jeni Tennison [mailto:mail@xxxxxxxxxxxxxxxx] > > Jose, > > > Am I completely wrong on my understanding of XSLT? > > Almost ;) The big gap seems to be that you think that the XSLT > processor is working on a stream of information that comes into it > from the XML, and is writing out a stream of information concurrently. > This is inaccurate. > > What actually happens is that the XSLT processor parses the XML > document and creates from it a tree representation of the document > (the Document Object Model or DOM). It stores this DOM in memory and > operates over it to create another DOM (the result). Then it > (usually) serialises this DOM into an output file. > Jeni, I understand the semantic model defined by XSLT, which is what you describe here. But as far as I understand, it is up to the implementations to process the tree in some order, top-down in parallel, bottom-up, using a DFS algorithm or whatever. It could be argued that streamming the transformation is nothing else than being able to use a pure DFS algorith for the transformation. Granted, this is not always possible. > One of the points about XSLT is that processors don't *have* to do all > the processing sequentially: there is nothing in the language that > prevents them from applying templates to s1 nodes, s2 nodes, s3 nodes > and so on, all concurrently, and then constructing the result from > that directly. > I think one of the problems of XSLT is this idea that anything can fit in memory or "big documents are very rare, and can be very slow". In my case we need to process very large documents which may grow in an order of magnitude by the transformation. So you can imagine that the memory-in/memory-out paradigm becomes a real problem. At the same time this is not a once a day batch file kind of operation but it must happened concurrently with other requests of the same type, which means memory utilization is even more important. > Some processors allow asynchronous and partial transformation. You > could have a look at MSXSL (XSLProcessor) and Saxon (saxon:preview) > which may be helpful. > I would like to have some discussion, at the semantic level, on identifying a subset of the XSLT language that can be process using a DFS approach. Can a useful subset of the language be identified. By this I mean that, unless indicated to be cached, once all template is applied to an element, such element node can be removed discarded. Can engines use such analysis to optimize memory utilization? Does XSLT provides enough information to be able to acomplish this. > In your case, where you're just copying nodes across from one tree > into a different structure, you can use xsl:copy-of rather than > xsl:apply-templates: > > <xsl:template match="doc1" > <doc2> > <xsl:copy-of select="s1[1]" /> > <xsl:copy-of select="s3" /> > <xsl:copy-of select="s4[1]" /> > <xsl:copy-of select="s2[1]" /> > <xsl:copy-of select="s5" /> </doc2> > </xsl:template> > Well, the example was just to showcase the issues I am refering to. In reality each of the "s?" nodes will be further transformed. One of the things to notice here is that expressions like select="s3" do not seem to be DFS friendly since you need to collect all "s3" nodes and hence need to wait for all the "doc1" node to be in memory before you can continue to process anything. In particular if there is no "s3" nodes you will not know even after finding "s4" because XSLT has no idea of the input DTD. Which I think is one of the more STUPID things on the spec. Now, assumming that we know the input DTD, can we write a better template set to obtain the expected result? <!ELEMENT doc1 (s1, s2, s3*, s4?, s5+) > Is there anything missing? <xsl:template match="doc1" > <doc2> <xsl:apply-templates/> </doc2> </xsl:template> <xsl:template match="s1" > <xsl:copy-of select="." /> </xsl:template> <xsl:template match="s2" > <!-- variable 's2var' must have scope 'doc1' somehow --> <xsl:variable name="s2var" select="." /> </xsl:template> <xsl:template match="s3" > <xsl:copy-of select="." /> </xsl:template> <xsl:template match="s4" > <xsl:copy-of select="." /> </xsl:template> <xsl:template match="s5[1]" > <xsl:copy-of select="$s2var" /> <xsl:copy-of select="." /> </xsl:template> <xsl:template match="s5[????]" > <!-- Some way to say position larger than 1 --> <xsl:copy-of select="$s2var" /> <xsl:copy-of select="." /> </xsl:template> I do not think it is possible to use s2var the way I am doing, but you can see that this set of templates can be easily process in an streamming fashion. The only memory needed is that to store the subtree "s2". Now, just think that the actual XML document being process is a 'doc' defined by: <!ELEMENT doc doc1* > and the file contains 10 million records. This is the kind of issues people would like to be able to solve using XSLT in a very efficient manner. If XSLT is the wrong tool, are there any other XML transformation languages, in the works, that may be better suited for this kind of work? Jose Alberto XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: Efficient Stylesheets for reord, Kay Michael | Thread | Re[2]: Efficient Stylesheets for re, Jeni Tennison |
distinct filtering, GarethS | Date | RE: SAX mode, DOM mode and caching, Kevin Jones |
Month |