[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
[xsl] Efficient recursive grouping in XSLT 1.0?
Subject: [xsl] Efficient recursive grouping in XSLT 1.0? From: Ian Young <I.M.Young@xxxxxxxxxx> Date: Thu, 18 Mar 2004 22:30:13 GMT |
I have some fairly sizeable XML files (about 5MB - 15MB) that I want to extract the element name structure from. By that, I mean that I'll feed an arbitrary XML document in one end, and get out an XML document that contains one element for every distinct list of name() values in the ancestor-or-self axis, retains the position in document order of the first occurrence. I'd also like a count for each element output. For the moment, I'm only interested in the elements: attributes and text content are not important. For example, if I have this input document: <root> <a> <b/><c/><d/><b/> </a> <b> <a> <a/> </a> </b> <a> <a/> <b> <a/> <d/> <a/> <a/> </b> </a> </root> I would want output that looks like this: <root ct="1"> <a ct="2"> <b ct="3"> <a ct="3"/> <d ct="1"/> </b> <c ct="1"/> <d ct="1"/> <a ct="1"/> </a> <b ct="1"> <a ct="1"> <a ct="1"/> </a> </b> </root> In XSLT 2.0 this can be readily achieved by using for-each-group recursively. With Saxon 7.9 this stylesheet will generate a result for one of my 7MB files (400k elements) in 2 or 3 seconds (mostly parsing the input document) to produce the output of about 1k elements. <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:template match="/"> <xsl:call-template name="subtree"> <xsl:with-param name="parents" select="/"/> </xsl:call-template> </xsl:template> <xsl:template name="subtree"> <xsl:param name="parents"/> <xsl:for-each-group select="$parents/*" group-by="name()"> <xsl:copy> <xsl:attribute name="ct"> <xsl:value-of select="count(current-group())"/> </xsl:attribute> <xsl:call-template name="subtree"> <xsl:with-param name="parents" select="current-group()"/> </xsl:call-template> </xsl:copy> </xsl:for-each-group> </xsl:template> </xsl:stylesheet> This is great, except that the system I want this to run in is using libxml and libxslt (consequently, XSLT 1.0). Now, transforming this into an XSLT 1.0 stylesheet can be done by just replacing the for-each-group with a for-each and checking if the context node is the first in $parents/* with that name: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:template match="/"> <xsl:call-template name="subtree"> <xsl:with-param name="parents" select="/"/> </xsl:call-template> </xsl:template> <xsl:template name="subtree"> <xsl:param name="parents"/> <xsl:for-each select="$parents/*"> <xsl:variable name="current-group" select="$parents/*[name() = name(current())]"/> <xsl:if test="count($current-group[1] | .) = 1"> <xsl:copy> <xsl:attribute name="ct"> <xsl:value-of select="count($current-group)"/> </xsl:attribute> <xsl:call-template name="subtree"> <xsl:with-param name="parents" select="$current-group"/> </xsl:call-template> </xsl:copy> </xsl:if> </xsl:for-each> </xsl:template> </xsl:stylesheet> But the performance of this on large files is hopeless. I tried a smaller 1.3MB, 75k elements input file that generates a 5kB, 125 element output. msxml took 2 minutes (memory usage displaying odd sawtooth pattern -- GC?), Saxon 7.9 over 5 minutes and libxslt 1.0.4 I gave up on after 10 minutes. [Aside: I wondered whether it might be better to expanded $current-group in the test and only created the variable within the if element. The rationale being that the variable might be created strictly whereas the combined xpath expression might only be evaluated up to the first element. ... <xsl:if test="count(($parents/*[name() = name(current())])[1] | .) = 1"> <xsl:variable name="current-group" select="$parents/*[name() = name(current())]"/> ... For msxsl makes an enormous difference: it's now faster than Saxon 7 running the XSLT 2.0 stylesheet -- that's for the 7MB file! How does _that_ work?! It doesn't make any great difference to Saxon (or libxslt?).] Am I right that there's no way of using Muenchian grouping for this because there's no possible XPath 1.0 expression for the list of ancestors' node names? Are there any other approaches with XSLT 1 that might get a decent performance with libxslt? One thing I thought of is to do the transform in two steps: the first generates output isomorphic to the input elements, but annotates each with its ancestor name list (as a string); the second traverses that file using Muenchian grouping. This works -- albeit with rather variable performance -- in msxml. For the 1.3MB file step 2 takes 71 seconds in Saxon and 75 in libxslt (and 1 second in msxml). This confuses me! TODO: Saxon gets java.lang.OutOfMemoryError on the second stage reading the output of stage 1 for larger files. How do I increase Java's memory limit? <!-- step 1 --> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:template match="*"> <xsl:param name="parent-path" select="''"/> <xsl:variable name="p" select="concat($parent-path,'/',name())"/> <xsl:copy> <xsl:attribute name="p"> <xsl:value-of select="$p"/> </xsl:attribute> <xsl:apply-templates> <xsl:with-param name="parent-path" select="$p"/> </xsl:apply-templates> </xsl:copy> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet> <!-- step 2 --> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:key name="p" match="*" use="@p"/> <xsl:template match="*"> <xsl:if test="generate-id() = generate-id(key('p', @p)[1])"> <xsl:copy> <xsl:attribute name="ct"> <xsl:value-of select="count(key('p', @p))"/> </xsl:attribute> <xsl:for-each select="key('p', @p)"> <xsl:apply-templates/> </xsl:for-each> </xsl:copy> </xsl:if> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet> Any comments, pointers, code suggestions gratefully received! Cheers, Ian. XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] XSL-List Software Changes on , Mulberry Technologie | Thread | RE: [xsl] Efficient recursive group, Michael Kay |
[xsl] XSL-List Software Changes on , Mulberry Technologie | Date | RE: [xsl] Cannot get numeric value , Michael Kay |
Month |