[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] finding a minimum set of references


Subject: Re: [xsl] finding a minimum set of references
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Wed, 9 Jan 2013 18:39:40 -0800

Graydon,

I am glad this was useful.

Forgive me for not mentioning keys explicitly -- I can't imagine that
someone wouldn't use keys for this :)

Cheers,
Dimitre

On Wed, Jan 9, 2013 at 5:55 PM, Graydon <graydon@xxxxxxxxx> wrote:
> On Wed, Jan 09, 2013 at 04:56:44AM -0800, Dimitre Novatchev scripsit:
>> This can be done in three steps:
>>
>>  1. Get the distinct values of concat(@area,'+',@site) for all tree elements.
>>
>>  2. Get a sorted sequence of the corresponding `loc` elements that
>> were produced in step 1. The sort is by the count of `tree` elements
>> that have that particular loc element as descendant.
>>
>>  3. Recursively add to the "minList" the top of the sorted loc
>> elements that isn't yet in this "minList". Stop condition -- when
>> there is nothing more to add.
>
> This is very similar to what I did, after having woken up at the 3 AM
> local going "wait! I can do this!" and making rapid notes.  (Having
> learnt what happens should I _not_ make those notes!)
>
> (Much thanks to Wolfgang for saying "use keys!" last night.)
>
> What I did was to go through each tree element's loc element children, and
> count how many other such loc elements existed anywhere in the data set. (Keys
> were very useful for this, and just let me note that trying to stick the result
> of a key lookup that returns a node in a comment via xsl:sequence during
> debugging is something I should remember not to do more reliably.)
>
> Once I had that count associated with each loc element (@other), I deleted all
> the loc elements but the one with the highest value for @other.
>
> Third step was to make a unique list of the @area/@cite pairs contained in
> those surviving loc elements, and consider that to represent the minimum
> spanning set of documents that contain all the element parent/child pattern
> examples.
>
> (There are other steps about "the example of this parent/child pattern is in
> this document" and fetching those documents and so on, but those aren't in the
> code sample below, which is already plenty long enough.)
>
>> I believe that this would really produce the minimum sequence of
>> locations -- due to the fact that we pick locations in decreasing
>> order of trees that reference them.
>
> I think what I've got, which I suspect is equivalent in a graph-theoretic
> sense, also does this.  I am also really really glad my current clients aren't
> going to ask me to _prove_ that.
>
> The input set this was tested on had 19,503 documents in it; the minimum
> spanning set of examples that results has 294 documents in it, so about 1.5% of
> the total.  If that's not actually minimal, it's close enough.  Total,
> end-to-end run time (via Saxon 9.4.something in oXygen), from "make a
> collection out of this half-GB pile of documents" to "minimum spanning set of
> documents in a pile" is about 65 seconds. (The bit below, run independently on
> the big set of tree elements, takes about a second.)
>
> I'm really pleased with this result.  (Some of the previous attempts were
> taking more than half an hour of run time to not work....)
>
>> I would be willing to produce the code that implements this algorithm,
>> but I would need to be provided with a complete (but not too long)
>> source XML document -- so that the result could be easily verified
>
> That's extraordinarily kind of you, but I really did just need the algorithm
> hint.  (This time, at least! :)
>
> I'm greatly reassured at the similarity of the suggested methodology, too!
>
> Thanks, Dimitre!
>
> -- Graydon
>
>
> <xsl:stylesheet exclude-result-prefixes="xs xd" version="2.0" xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl"
>   xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
>   <xd:doc scope="stylesheet">
>     <xd:desc>
>       <xd:p><xd:b>Created on:</xd:b> Jan 8, 2013</xd:p>
>       <xd:p><xd:b>Author:</xd:b> graydon</xd:p>
>       <xd:p/>
>     </xd:desc>
>   </xd:doc>
>   <xsl:key match="//loc" name="locKey" use="concat(@area,'|',@cite)"/>
>   <xsl:template match="/">
>     <xsl:variable name="getLocCounts">
>       <xsl:apply-templates mode="countLoc" select="node()"/>
>     </xsl:variable>
>     <xsl:variable name="findMostOther">
>       <xsl:apply-templates mode="mostOther" select="$getLocCounts"/>
>     </xsl:variable>
>     <xsl:variable name="makeDocumentList">
>       <xsl:apply-templates mode="docList" select="$findMostOther"/>
>     </xsl:variable>
>     <xsl:sequence select="$makeDocumentList"/>
>   </xsl:template>
>   <xsl:template match="node()|@*" mode="countLoc mostOther docList prettify">
>     <xsl:copy>
>       <xsl:apply-templates mode="#current" select="node()|@*"/>
>     </xsl:copy>
>   </xsl:template>
>   <xsl:template match="loc" mode="countLoc">
>     <xsl:copy>
>       <xsl:attribute name="other" select="count(key('locKey',concat(@area,'|',@cite))[not(. is current())])"/>
>       <xsl:apply-templates mode="countLoc" select="node()|@*"/>
>     </xsl:copy>
>   </xsl:template>
>   <xsl:template match="locations" mode="mostOther">
>     <xsl:copy>
>       <xsl:for-each-group group-by="@other" select="loc">
>         <xsl:sort data-type="number" order="descending" select="@other"/>
>         <xsl:if test="position() eq 1">
>           <xsl:sequence select="current-group()[1]"/>
>         </xsl:if>
>       </xsl:for-each-group>
>     </xsl:copy>
>   </xsl:template>
>   <xsl:template match="container" mode="docList">
>     <wkna-shared-cms>
>       <assembly area="testpub" xml:lang="en">
>         <xsl:for-each-group group-by="concat(@area,'|',@cite)" select="tree/locations/loc">
>           <xsl:sort select="current-grouping-key()"/>
>           <include area="{current-group()[1]/@area}" cite="{current-group()[1]/@cite}"/>
>         </xsl:for-each-group>
>       </assembly>
>     </wkna-shared-cms>
>   </xsl:template>
> </xsl:stylesheet>
>



-- 
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
To avoid situations in which you might make mistakes may be the
biggest mistake of all
------------------------------------
Quality means doing it right when no one is looking.
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
Facts do not cease to exist because they are ignored.
-------------------------------------
Typing monkeys will write all Shakespeare's works in 200yrs.Will they
write all patents, too? :)
-------------------------------------
I finally figured out the only reason to be alive is to enjoy it.


Current Thread
Keywords
xml