[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] Extraction of data using key() and matches()


Subject: [xsl] Extraction of data using key() and matches()
From: Jakob Fix <jakob.fix@xxxxxxxxx>
Date: Sat, 5 Jun 2010 21:02:20 +0200

Hello,

I have a large number of XML data files which contain a table with
rows and data cells each (previously Excel files).

I'm interested in finding out whether in the table's data cells there
is or is not a given country name. If so I want to record in another
file all country names that appear in the data file. The country name
may be the only content of the data cell (<col>United Kingdom</col>),
or it may be surrounded by other text (<col>Data has been provided for
United Kingdom only.</col>). It can also be that more than one country
name appears in a table cell. There won't be other elements in the
cell, just character data.

My current approach is to have an exhaustive lookup files with *all*
country names that are potentially used. For each XML data file, I
loop over all country names and query the contents of each data file
whether it matches the current country name.

The following works but is rather slow:

countries.xml

<countries>
  <country code="ABW">
    <fr>Aruba</fr>
    <en>Aruba</en>
  </country>
  <country code="AFG">
    <fr>Afghanistan</fr>
    <en>Afghanistan</en>
  </country>
  ...
</countries>

data.xml

<workbook>
  <sheet>
    <name><![CDATA[Figure 1.1 (I)]]></name>
    <row number="0">
      <col number="0"><![CDATA[United Kingdom]]></col>
    </row>
    <row number="1">
      <col number="0"><![CDATA[Part I. ]]></col>
      <col number="1"><![CDATA[These data apply to France, Germany and
a couple of other countries.]]></col>
     ...
    </row>
   ...
  </sheet>
</workbook>

extract.xsl

<xsl:for-each select="document($country-file)/countries/country/en">
  <xsl:variable name="current-node" select="."/>
  <xsl:if test="$data-doc//col[matches(., $current-node/text())]">
    <country><xsl:value-of select="$current-node/../@code"/></country>
  </xsl:if>
</xsl:for-each>


In order to speed up the process I was thinking about indexing all
data cells using xsl:key. However, I cannot see how the key() and the
matches() function can be combined to use the former's speed with the
latter's regex power.

I was hoping of doing something along these lines, but would need some
help as this doesn't currently work:

<xsl:key name="cell" match="col" use="text()"/><!-- create an index of
the cells' contents -->

<xsl:for-each select="document($country-file)/countries/country/en">
  <xsl:variable name="current-node" select="."/><!-- don't lose the
current node -->
  <xsl:for-each select="document($data-file)"><!-- change context to
data document -->
    <!-- key returns a nodeset, so count the number of nodes in the nodeset.
          this doesn't work if the country name is not the only content -->
    <xsl:if test="count(key("cell", $current-node)) > 0">
      <country><xsl:value-of select="$current-node/../@code"/></country>
    </xsl:if>
  </xsl:for-each>
</xsl:for-each>

Maybe there's another solution that is more elegant and more efficient
than what I've shown above. I'd love to know about it.  Thank you in
advance for your help.

Jakob.


Current Thread
Keywords
xml