[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] HST's answers Re: [xsl] Efficient way to check sequence membership -


Subject: Re: [xsl] HST's answers Re: [xsl] Efficient way to check sequence membership -
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Thu, 03 Mar 2011 00:06:43 +0000

On 02/03/2011 22:00, Henry S. Thompson wrote:
I've thought of five ways to do this:

  1) tokenise and use "some ...", as in the previous message;
  2) Add '|' at the beginning of both $stopPat and the word to be
     checked, and use contains;
  3) Put a sequence of elements with a 'w' attribute whose value is a stop
      in $stops, then do boolean($stops/*[@w=$w]);
  4) As above, but then define an appropriate key and use
      boolean($stops/key('stop',$w));
  5) Build a regexp and use match:
      concat('^(',$stopPat,')$')

Your results don't surprise me. For all except (4), the naive implementation is O(n) whereas for (4) it is O(n log n). Now we don't know what n is, but if it's large enough then O(n log n) will always win. The fact that the different O(n) solutions have different multiplying factors is always worth remembering, of course.

I'd be interested to see what the results are using the more aggressive optimiser of Saxon-EE. I haven't got your data so I can't offer any measurements, but I did look at the -explain output for various ways of writing this expression, and the only one I found that constructed an index was the perhaps non-obvious

<xsl:sequence select="exists($stops[. = $w])"/>

This suggests some opportunities for rewrites that I hadn't previously considered.

Michael Kay
Saxonica

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Version raw time time - baseline

    0         5
    4         7          2
    2         8          3
    2a        8          3
    1a       14          9
    1        15         10
    3        28         23
    5        30         25

where 0 is the baseline where the stop function does no actual work,
and the time is average over 100 iterations, in milliseconds.

I'm really interested if anyone has a better approach.  Of course, I'm
also interested to find out if other implementations show a similar
pattern.

I've put up a gzipped tar file [1] of all the files you need to
reproduce the experiment -- one .xsl for each version, and q.xml for
input.

The stopss.xsl file is there so you can test that you are getting the
right answer!  Replace my:stop1 with your version in that file, and
check that the output is

243367200142031010020120103000130001022001513610014414440

ht

[1] http://www.ltg.ed.ac.uk/~ht/memberCheck.tar.gz


Current Thread