[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] Efficient recursive grouping in XSLT 1.0?


Subject: [xsl] Efficient recursive grouping in XSLT 1.0?
From: Ian Young <I.M.Young@xxxxxxxxxx>
Date: Thu, 18 Mar 2004 22:30:13 GMT

I have some fairly sizeable XML files (about 5MB - 15MB) that I want to
extract the element name structure from. By that, I mean that I'll feed an
arbitrary XML document in one end, and get out an XML document that
contains one element for every distinct list of name() values in the
ancestor-or-self axis, retains the position in document order of the first
occurrence. I'd also like a count for each element output.
For the moment, I'm only interested in the elements: attributes and text
content are not important.

For example, if I have this input document:
<root>
  <a>
    <b/><c/><d/><b/>
  </a>
  <b>
    <a>
      <a/>
    </a>
  </b>
  <a>
    <a/>
    <b>
      <a/>
      <d/>
      <a/>
      <a/>
    </b>
  </a>
</root>

I would want output that looks like this:
<root ct="1">
  <a ct="2">
    <b ct="3">
      <a ct="3"/>
      <d ct="1"/>
    </b>
    <c ct="1"/>
    <d ct="1"/>
    <a ct="1"/>
  </a>
  <b ct="1">
    <a ct="1">
      <a ct="1"/>
    </a>
  </b>
</root>

In XSLT 2.0 this can be readily achieved by using for-each-group
recursively. With Saxon 7.9 this stylesheet will generate a result for one
of my 7MB files (400k elements) in 2 or 3 seconds (mostly parsing the input
document) to produce the output of about 1k elements.

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

  <xsl:template match="/">
    <xsl:call-template name="subtree">
      <xsl:with-param name="parents" select="/"/>
    </xsl:call-template>
  </xsl:template>
  
  <xsl:template name="subtree">
    <xsl:param name="parents"/>
    <xsl:for-each-group select="$parents/*" group-by="name()">
      <xsl:copy>
        <xsl:attribute name="ct">
          <xsl:value-of select="count(current-group())"/>
        </xsl:attribute>
        <xsl:call-template name="subtree">
          <xsl:with-param name="parents" select="current-group()"/>
        </xsl:call-template>
      </xsl:copy>
    </xsl:for-each-group>
  </xsl:template>
</xsl:stylesheet>

This is great, except that the system I want this to run in is using libxml
and libxslt (consequently, XSLT 1.0).
Now, transforming this into an XSLT 1.0 stylesheet can be done by just
replacing the for-each-group with a for-each and checking if the context
node is the first in $parents/* with that name:

<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

  <xsl:template match="/">
    <xsl:call-template name="subtree">
      <xsl:with-param name="parents" select="/"/>
    </xsl:call-template>
  </xsl:template>
  
  <xsl:template name="subtree">
    <xsl:param name="parents"/>
    <xsl:for-each select="$parents/*">
      <xsl:variable name="current-group"
          select="$parents/*[name() = name(current())]"/>
      <xsl:if test="count($current-group[1] | .) = 1">
        <xsl:copy>
          <xsl:attribute name="ct">
            <xsl:value-of select="count($current-group)"/>
          </xsl:attribute>
          <xsl:call-template name="subtree">
            <xsl:with-param name="parents" select="$current-group"/>
          </xsl:call-template>
        </xsl:copy>
      </xsl:if>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

But the performance of this on large files is hopeless. I tried a smaller
1.3MB, 75k elements input file that generates a 5kB, 125 element
output. msxml took 2 minutes (memory usage displaying odd sawtooth pattern
-- GC?), Saxon 7.9 over 5 minutes and libxslt 1.0.4 I gave up on after 10
minutes.

[Aside: I wondered whether it might be better to expanded $current-group in
the test and only created the variable within the if element. The rationale
being that the variable might be created strictly whereas the combined
xpath expression might only be evaluated up to the first element.

  ...
  <xsl:if test="count(($parents/*[name() = name(current())])[1] | .) = 1">
    <xsl:variable name="current-group"
        select="$parents/*[name() = name(current())]"/>
    ...

For msxsl makes an enormous difference: it's now faster than Saxon 7
running the XSLT 2.0 stylesheet -- that's for the 7MB file!  How does
_that_ work?! It doesn't make any great difference to Saxon (or libxslt?).]

Am I right that there's no way of using Muenchian grouping for this because
there's no possible XPath 1.0 expression for the list of ancestors' node
names?

Are there any other approaches with XSLT 1 that might get a decent
performance with libxslt?

One thing I thought of is to do the transform in two steps: the first
generates output isomorphic to the input elements, but annotates each with
its ancestor name list (as a string); the second traverses that file using
Muenchian grouping. This works -- albeit with rather variable performance
-- in msxml. For the 1.3MB file step 2 takes 71 seconds in Saxon and 75 in
libxslt (and 1 second in msxml). This confuses me!

TODO: Saxon gets java.lang.OutOfMemoryError on the second stage reading the
output of stage 1 for larger files. How do I increase Java's memory limit?

<!-- step 1 -->
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

  <xsl:template match="*">
    <xsl:param name="parent-path" select="''"/>
    <xsl:variable name="p" select="concat($parent-path,'/',name())"/>
    <xsl:copy>
      <xsl:attribute name="p">
        <xsl:value-of select="$p"/>
      </xsl:attribute>
      <xsl:apply-templates>
        <xsl:with-param name="parent-path" select="$p"/>
      </xsl:apply-templates>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="text()"/>
</xsl:stylesheet>

<!-- step 2 -->
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

  <xsl:key name="p" match="*" use="@p"/>

  <xsl:template match="*">
    <xsl:if test="generate-id() = generate-id(key('p', @p)[1])">
      <xsl:copy>
        <xsl:attribute name="ct">
          <xsl:value-of select="count(key('p', @p))"/>
        </xsl:attribute>
        <xsl:for-each select="key('p', @p)">
          <xsl:apply-templates/>
        </xsl:for-each>
      </xsl:copy>
    </xsl:if>
  </xsl:template>

  <xsl:template match="text()"/>
</xsl:stylesheet>

Any comments, pointers, code suggestions gratefully received!

Cheers,

Ian.

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



Current Thread
Keywords