[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] efficient traversal of combined collections in XSLT 3.0


Subject: [xsl] efficient traversal of combined collections in XSLT 3.0
From: Graydon <graydon@xxxxxxxxx>
Date: Sat, 24 Nov 2012 08:53:06 -0500

So I have about 4.0 GB of "production" content, XML that's already in use, can have deliverables generated from it, and which various groups of editors may change.

I have "content", some content (generally about .2 or .25 GB) that is being converted from SGML and which, before it is added to "production", needs to be checked to see if the links in it work.

links use a combination of @area (the name of a uniqueness of numbers) and @cite (the number); this is for legislation, so the numbers can get complicated by the basic scheme is pretty simple.  (targets are one direction in a bi-directional relationship, so a link in a fancy hat; they usually contain links, and we only need to check them if they _don't_ contain a link.)

The slightly tricky bit is that I want to check the links in "content" to see if they match something in "content" _and_ in "production"; XSLT 3.0's version of key() will accept an arbitrary top-node, so (using the Saxon 9.4 which ships with current, 14.1 oXygen) I can declare the stylesheet to be version 3.0, combine "production" and "content" into "searchSpace", and define a key on that.

<xsl:stylesheet exclude-result-prefixes="xs xd" version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:variable name="content" select="collection('file:///home/graydon/stages/APFF?recurse=yes;select=*.xml')"/>
  <xsl:variable name="production"
    select="collection('file:///home/graydon/stages/production/2012-11-13?recurse=yes;select=*.xml;on-error=ignore')"/>
  <xsl:variable name="searchSpace" select="($content,$production)"/>
  <xsl:key match="*[num[@cite]]" name="places" use="concat(ancestor-or-self::*[@area][1]/@area,'|',num[1]/@cite)"/>
  <xsl:template match="/">
    <bucket>
      <xsl:for-each select="$content//link,$content//target[not(reference-text/link)]">
        <xsl:choose>
          <xsl:when test="key('places',concat(current()/@area,'|',current()/@cite),$searchSpace)">
            <good>
              <uri>
                <xsl:sequence select="base-uri(.)"/>
              </uri>
              <xsl:sequence select="."/>
            </good>
          </xsl:when>
          <xsl:otherwise>
            <bad>
              <uri>
                <xsl:sequence select="base-uri(.)"/>
              </uri>
              <xsl:sequence select="."/>
            </bad>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:for-each>
    </bucket>
  </xsl:template>
</xsl:stylesheet>

This works well on content-sized chunks of input (.25 GB or so) and I get an answer in about 15 seconds.

It doesn't work on the full data set; 16 GB of RAM isn't enough to do this to 4 GB of data.  Various wheels are in motion to get more RAM.

So maybe everything will be fine, but I can't help looking at that code and going "this is a really naive search; there has to be a more efficient way to do this."

So, O XSLT List, what's the more efficient way to do this?

Thanks!

-- Graydon


Current Thread
Keywords