[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] faster complicated counting


Subject: [xsl] faster complicated counting
From: Syd Bauman <Syd_Bauman@xxxxxxxxx>
Date: Wed, 29 Feb 2012 10:31:28 -0500

I am working with a relatively small dataset (~ 1 MiB) which uses a
TEI encoding. In TEI, a line of verse is encoded with an <l> element
(of which I have just about 306,000), which are grouped into groups
(like poems or stanzas) using <lg> (for "line group").

In the output of the particular process I am working on now, I'd like
to adorn each <l> element with three new attributes that indicate the
count of the current <l> element in various contexts:
  wwp:num-global   = with respect to the entire document
  wwp:num-local    = with respect to the current stanza or other
                     small unit of poetry
  wwp:num-regional = with respect to the current poem or other
                     large unit of poetry

So, as a toy example, see tiny.in.xml and tiny.out.xml, below.

I have worked out code that gets me the desired counts. My problem is
that all the tree-walking it does slows down my process by well over
an order of magnitude. I am betting there is a much better way to do
this, probably using keys or <xsl:number>, but have not been able to
wrap my mind around it.

The English-like pseudo-code for @num-local is "the count in the
context of the closest ancestor <lg> that itself has > 4 metrical
lines".

The English-like pseudo-code for @num-regional is "the count in the
context of the closest ancestor <lg> that has a @type that contains
"poem" or whose first descendant <l> has n='1'".

Here's what I have (note that we are only counting those <l> elements
that have an @part of 'I' or do not have a @part attribute at all):

  <xsl:attribute name="wwp:num-global">
    <xsl:number count="l[not(@part)]|l[@part='I']" level="any"/>
  </xsl:attribute>
  <xsl:attribute name="wwp:num-regional">
    <xsl:variable name="region"
     select="(ancestor::lg[contains( @type,'poem') ]|ancestor::lg[ descendant::l[ @n eq '1'] ])[last()]"/>
    <xsl:value-of
     select="count((preceding::l[not(@part)]|preceding::l[@part='I'])[ancestor::lg/generate-id() = $region/generate-id() ] ) +1"/>
  </xsl:attribute>
  <xsl:attribute name="wwp:num-local">
    <xsl:variable name="region"
     select="ancestor::lg[count( descendant::l[not(@part) or @part='I'] ) > 4 ][1]"/>
    <xsl:value-of
     select="count((preceding::l[not(@part)]|preceding::l[@part='I'])[ancestor::lg/generate-id() = $region/generate-id() ] ) +1"/>
  </xsl:attribute>

Thoughts appreciated.

Notes
-----
* Yes, I realize that the test above is for *any* descendant <l> with
  n='1', not the first. We simply don't have any that aren't the
  first, so I didn't worry about it.

* It's pretty likely we'll change the definition of what is
  "regional" in the near future, but it probably won't affect the
  basic problem I'm having. I.e., I'm hoping that if someone shows me
  how to do this "regional" better, I'll be able to do any future
  version on my own. Cross your fingers :-)


toy input
--- -----
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"
     xmlns:wwp="http://www.wwp.brown.edu/ns/textbase/storage/1.0">
  <teiHeader>
    <!-- blah, blah, blah -->
  </teiHeader>
  <text>
    <body>
      <lg type="superStructure">
        <lg type="poem.duck">
          <l>one</l>
          <l>two</l>
          <l>three</l>
          <l>four</l>
          <l>five</l>
          <l>six</l>
          <l>seven</l>
          <l>eight</l>
          <l>nine</l>
          <l>ten</l>
        </lg>
        <lg type="poem.duck">
          <l>one</l>
          <l>two</l>
          <l>three</l>
          <l>four</l>
          <lg type="tercet">
            <l>five</l>
            <l>six</l>
            <l>seven</l>
          </lg>
          <l>eight</l>
          <l>nine</l>
          <l>ten</l>
        </lg>
        <lg type="poem.duck">
          <lg type="stanza">
            <l>one</l>
            <l>two</l>
            <l>three</l>
            <l>four</l>
            <l>five</l>
            <l>six</l>
            <l>seven</l>
            <l>eight</l>
          </lg>
          <lg type="stanza">
            <l>nine</l>
            <l>ten</l>
            <l>eleven</l>
            <l>twelve</l>
            <l>thirteen</l>
            <l>fourteen</l>
            <l>fifteen</l>
            <l>sixteen</l>
          </lg>
          <lg type="stanza">
            <l>seventeen</l>
            <l>eighteen</l>
            <l>nineteen</l>
            <l>twenty</l>
            <l>twentyone</l>
            <l>twentytwo</l>
            <l>twentythree</l>
            <l>twentyfour</l>
          </lg>
        </lg>
      </lg>
    </body>
  </text>
</TEI>

toy code
--- ----
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:wwp="http://www.wwp.brown.edu/ns/textbase/storage/1.0" xmlns="http://www.tei-c.org/ns/1.0"
  xpath-default-namespace="http://www.tei-c.org/ns/1.0" version="2.0">

  <xsl:template match="/">
    <xsl:text>&#x0A;</xsl:text>
    <xsl:apply-templates/>
  </xsl:template>
  <xsl:template match="@*|text()|processing-instruction()|comment()">
    <xsl:copy/>
  </xsl:template>
  <xsl:template match="*">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="l">
    <xsl:copy>
      <xsl:attribute name="wwp:num-global">
        <xsl:number count="l[not(@part)]|l[@part='I']" level="any"/>
      </xsl:attribute>
      <xsl:attribute name="wwp:num-regional">
        <xsl:variable name="region"
          select="(ancestor::lg[ contains( @type,'poem') ]|ancestor::lg[ descendant::l[ @n eq '1'] ])[last()]"/>
        <xsl:value-of
          select="count( (preceding::l[not(@part)]|preceding::l[@part='I'])[ancestor::lg/generate-id() = $region/generate-id() ] ) +1"
        />
      </xsl:attribute>
      <xsl:attribute name="wwp:num-local">
        <xsl:variable name="region"
          select="ancestor::lg[count( descendant::l[not(@part) or @part='I'] ) > 4 ][1]"/>
        <xsl:value-of
          select="count( (preceding::l[not(@part)]|preceding::l[@part='I'])[ancestor::lg/generate-id() = $region/generate-id() ] ) +1"
        />
      </xsl:attribute>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

toy output
--- ------
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:wwp="http://www.wwp.brown.edu/ns/textbase/storage/1.0">
  <teiHeader>
    <!-- blah, blah, blah -->
  </teiHeader>
  <text>
    <body>
      <lg type="superStructure">
        <lg type="poem.duck">
          <l wwp:num-global="1" wwp:num-regional="1" wwp:num-local="1">one</l>
          <l wwp:num-global="2" wwp:num-regional="2" wwp:num-local="2">two</l>
          <l wwp:num-global="3" wwp:num-regional="3" wwp:num-local="3">three</l>
          <l wwp:num-global="4" wwp:num-regional="4" wwp:num-local="4">four</l>
          <l wwp:num-global="5" wwp:num-regional="5" wwp:num-local="5">five</l>
          <l wwp:num-global="6" wwp:num-regional="6" wwp:num-local="6">six</l>
          <l wwp:num-global="7" wwp:num-regional="7" wwp:num-local="7">seven</l>
          <l wwp:num-global="8" wwp:num-regional="8" wwp:num-local="8">eight</l>
          <l wwp:num-global="9" wwp:num-regional="9" wwp:num-local="9">nine</l>
          <l wwp:num-global="10" wwp:num-regional="10" wwp:num-local="10">ten</l>
        </lg>
        <lg type="poem.duck">
          <l wwp:num-global="11" wwp:num-regional="1" wwp:num-local="1">one</l>
          <l wwp:num-global="12" wwp:num-regional="2" wwp:num-local="2">two</l>
          <l wwp:num-global="13" wwp:num-regional="3" wwp:num-local="3">three</l>
          <l wwp:num-global="14" wwp:num-regional="4" wwp:num-local="4">four</l>
          <lg type="tercet">
            <l wwp:num-global="15" wwp:num-regional="5" wwp:num-local="5">five</l>
            <l wwp:num-global="16" wwp:num-regional="6" wwp:num-local="6">six</l>
            <l wwp:num-global="17" wwp:num-regional="7" wwp:num-local="7">seven</l>
          </lg>
          <l wwp:num-global="18" wwp:num-regional="8" wwp:num-local="8">eight</l>
          <l wwp:num-global="19" wwp:num-regional="9" wwp:num-local="9">nine</l>
          <l wwp:num-global="20" wwp:num-regional="10" wwp:num-local="10">ten</l>
        </lg>
        <lg type="poem.duck">
          <lg type="stanza">
            <l wwp:num-global="21" wwp:num-regional="1" wwp:num-local="1">one</l>
            <l wwp:num-global="22" wwp:num-regional="2" wwp:num-local="2">two</l>
            <l wwp:num-global="23" wwp:num-regional="3" wwp:num-local="3">three</l>
            <l wwp:num-global="24" wwp:num-regional="4" wwp:num-local="4">four</l>
            <l wwp:num-global="25" wwp:num-regional="5" wwp:num-local="5">five</l>
            <l wwp:num-global="26" wwp:num-regional="6" wwp:num-local="6">six</l>
            <l wwp:num-global="27" wwp:num-regional="7" wwp:num-local="7">seven</l>
            <l wwp:num-global="28" wwp:num-regional="8" wwp:num-local="8">eight</l>
          </lg>
          <lg type="stanza">
            <l wwp:num-global="29" wwp:num-regional="9" wwp:num-local="1">nine</l>
            <l wwp:num-global="30" wwp:num-regional="10" wwp:num-local="2">ten</l>
            <l wwp:num-global="31" wwp:num-regional="11" wwp:num-local="3">eleven</l>
            <l wwp:num-global="32" wwp:num-regional="12" wwp:num-local="4">twelve</l>
            <l wwp:num-global="33" wwp:num-regional="13" wwp:num-local="5">thirteen</l>
            <l wwp:num-global="34" wwp:num-regional="14" wwp:num-local="6">fourteen</l>
            <l wwp:num-global="35" wwp:num-regional="15" wwp:num-local="7">fifteen</l>
            <l wwp:num-global="36" wwp:num-regional="16" wwp:num-local="8">sixteen</l>
          </lg>
          <lg type="stanza">
            <l wwp:num-global="37" wwp:num-regional="17" wwp:num-local="1">seventeen</l>
            <l wwp:num-global="38" wwp:num-regional="18" wwp:num-local="2">eighteen</l>
            <l wwp:num-global="39" wwp:num-regional="19" wwp:num-local="3">nineteen</l>
            <l wwp:num-global="40" wwp:num-regional="20" wwp:num-local="4">twenty</l>
            <l wwp:num-global="41" wwp:num-regional="21" wwp:num-local="5">twentyone</l>
            <l wwp:num-global="42" wwp:num-regional="22" wwp:num-local="6">twentytwo</l>
            <l wwp:num-global="43" wwp:num-regional="23" wwp:num-local="7">twentythree</l>
            <l wwp:num-global="44" wwp:num-regional="24" wwp:num-local="8">twentyfour</l>
          </lg>
        </lg>
      </lg>
    </body>
  </text>
</TEI>


Current Thread
Keywords
tei