[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] [Summary] Three ways to express in XPath that there are no duplicates in a list of items

Subject: Re: [xsl] [Summary] Three ways to express in XPath that there are no duplicates in a list of items
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Fri, 2 Nov 2012 06:01:06 -0700

> count(Websites/*) = count(distinct-values(Websites/*))

A more efficient version (for an XPath processor with weak optimizer)
of this is:


So, we only count up to:

and don't need to count all children of Websites and then compare them
to the count of distinct values.

This is the same as Dirichlet's principle (aka "pigeonhole principle"):

If n items are put into m pigeonholes with n > m, then at least one
pigeonhole must contain more than one item.


On Fri, Nov 2, 2012 at 2:33 AM, Liam R E Quin <liam@xxxxxx> wrote:
> On Fri, 2012-11-02 at 09:15 +0000, Costello, Roger L. wrote:
> [...]
>> Here's how to implement it in XPath 1.0 and in XPath 2.0.
>> XPath 1.0:
>>       not(Websites/*[. = preceding-sibling::*])
>> XPath 2.0:
>>       empty(Websites/*[index-of(../*,.)[2]])
>>       count(Websites/*) = count(distinct-values(Websites/*))
>> The preferred XPath is the last one because it has the best
>> performance. The first two take on the order of n-squared time (where
>> n is the number of websites in the list) whereas the last XPath
>> expression takes on the order of n log n time.
> Some brief comments on this...
> (1) the XPath 1 solution also works in later versions of XPath;
> (2) this only works to test simple text content
> (3) an XPath 2 (or later) processor can rewrite the first or second
> expression if it likes (and is smart enough) into the third
> (4) the last expression can be implemented in linear time, not n log n,
> because it's not not necessary to sort values to see if they are
> distinct (e.g. the implementation can use a hash table), although this
> does use more memory. But for simple content the hash table approach
> doesn't even need to use more memory, can just point into the document
> tree.
> So statements about performance need to be with respect to a particular
> version of a specific product, in a specific environment.
> And,
> (5) I'm sure these aren't the only ways to see if all items are
> distinct :-)
> You can start getting pretty fancy,
> some $w in Websites/Website satisfies count(Websites/Website[. eq $w])
> gt 1,
> or flwor expressions in XQuery,
> or use a key or (XPath in XSLT 3) a map, or.... :)
> Liam
> --
> Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
> Pictures from old books: http://fromoldbooks.org/
> Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

Dimitre Novatchev
Truly great madness cannot be achieved without significant intelligence.
To invent, you need a good imagination and a pile of junk
Never fight an inanimate object
To avoid situations in which you might make mistakes may be the
biggest mistake of all
Quality means doing it right when no one is looking.
You've achieved success in your field when you don't know whether what
you're doing is work or play
Facts do not cease to exist because they are ignored.
Typing monkeys will write all Shakespeare's works in 200yrs.Will they
write all patents, too? :)
I finally figured out the only reason to be alive is to enjoy it.

Current Thread