[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] Extracting the grouping from a flat structure


Subject: [xsl] Extracting the grouping from a flat structure
From: Peter Wyngaard <peter@xxxxxxxxxxxx>
Date: Sun, 5 Dec 2004 19:34:09 -0500

Hello!

I have a nested group structure in HTML that I would like to translate to XML using XSL. In text form, here is what the HTML table looks like:

Header1
  Row 1
  Row 2
  ...
  Row N

Header2
  Row 1
  Row 2
  ...
  Row M

...

I would like to create XML output that looks like:

<header attr=...>
  <row>...</row>
  <row>...</row>
  ...
  <row>...</row>
</header attr=...>
<header>
  <row>...</row>
  <row>...</row>
  ...
  <row>...</row>
</header attr=...>
...

This should be very straightforward, but the problem is that the nested group structure is not reflected in the HTML. Rather than using nested HTML tables, the whole thing is expressed as one big table with a lot of rows:

<table class="results">
 <tr>
  <th>Header1</th>
 </tr>
 <tr>
  <td>row1</td>
 </tr>
 <tr>
  <td>row2</td>
 </tr>
 ...
 <tr>
  <td>rowN</td>
 </tr>
<tr>
  <td>&nbsp;</td> <!-- blank line between groups -->
 </tr>
 <tr>
  <th>Header2</th>
 </tr>
 <tr>
  <td>row1</td>
 </tr>
 <tr>
  <td>row2</td>
 </tr>
 ...
 <tr>
  <td>rowM</td>
 </tr>
 ...
</table>

The only good news is that the "header" rows are written using <th> tags instead of <td> tags, so I can differentiate "headers" from "rows". Inspired by some of the posts on this most excellent mailing list, I came up with the following XSL to accomplish the task:

<xsl:for-each select='//TABLE[@class="results"]/TR[TH]'>
<header>
<xsl:attribute name=...>...</xsl:attribute>
<xsl:variable name='thisHeader' select='generate-id(.)'/>
<xsl:for-each select='following-sibling::TR[$thisHeader=generate-id(preceding- sibling::TR[TH][1])]'>
<row>
...
</row>
</xsl:for-each>
</header>
</xsl:for-each>


This works great, but it's pretty darn inefficient. I'm dealing with tables that have hundreds of rows, and around a dozen "header" sections. So my nested for-each loops are causing hundreds of TR nodes to be evaluated about a dozen times. I'm processing thousands of HTML files, and there are 6 different types of HTML files, each one has it's own XSL file for extracting data. None of the other HTML file types has this weird structural problem, and they all process very quickly. When one of these weird files is encountered, it takes 5-6 times longer to process.

It seems like people try to exploit the use of "keys" as much as possible when trying to maximize processing time efficiency, but I haven't been able to wrap my head around a "key" solution for this problem yet.

Can someone think of a more efficient way of dealing with this case?

Thanks!

Peter


Current Thread
Keywords