[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] deduplicating information in XML files


Subject: Re: [xsl] deduplicating information in XML files
From: "G. Ken Holman" <gkholman@xxxxxxxxxxxxxxxxxxxx>
Date: Fri, 12 Oct 2012 10:20:41 -0400

At 2012-10-12 14:02 +0200, Robby Pelssers wrote:
Hi all,

This time I have a rather challenging task at hand. Let me first describe the use case. We have lots of product information stored in XML. Some of that information describes
. Technical applications
. Features and benefits
. Technical summary


One of the problems is a lot of products had e.g. the same features and benefits as they are of the same product family or group. But as we stored that info per product it got duplicated. Now we want to deduplicate that info by generating DITA maps and topics (both are just XML). Now for simplicity let's assume we generate the following content for product1 and product2. The goal is to get from INPUT to OUTPUT by checking if the body of the linked topics are duplicates, next create 1 generic topic and rewrite the links in the map to point to that single topic. I have XSLT / XQuery (XMLDB) and Java at my disposal to get the job done. I'm not sure what will be the easiest way to get the job done. Keep also in mind that my INPUT will contain a few 1000 files (maps and linked topics) and I will need to deduplicate the whole set ;-)

Thx upfront for any input,
Robby

INPUT

Product1_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product1_FandB.xml "/>
</map>

Product1_FandB.xml:
<content>
<meta>
<id>product1</id>
<meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
<body>
</content>


Product2_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product2_FandB.xml "/>
</map>

Product2_FandB.xml:
<content>
<meta>
<id>product2</id>
<meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
<body>
</content>


Expected output:

Product1_map.xml
<map>
  <features-benefits-ref href="features-benefits/FandB_1.xml "/>
</map>

Product2_map.xml
<map>
  <features-benefits-ref href="features-benefits/FandB_1.xml "/>
</map>

FandB_1.xml:
<content>
<meta>
<id><!- can become empty -> </id>
<meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
<body>
</content>

I hope the complete solution below in XSLT is helpful. I see that Wendell posted while I was working on this, and I like his idea of using the collection() function rather than my hardwired map of maps. I'll leave that with you as an exercise. You can also tweak the file name generation as you need. Oh, and I also added some additional data.


I was really curious about this solution. In the classroom I teach the three methods of grouping in XSLT 1: by axes, by keys and by variables. When I talk about XSLT 2 I claim (or used to claim!) that these methods were no longer needed. But ... I had to use the variable method in XSLT 2 in order to solve your requirement! So I'll have to change my classroom materials to reflect this.

The reason I had to use the variable-based grouping method is that the XSLT 2 <xsl:for-each-group>'s group-by= attribute is based on the value calculated, not on the structure. I had to use deep-equal() in order to determine if the structure was the same. So that ruled out <xsl:for-each-group>. So I instantly turned to the XSLT 1 variable-based method in order to work across documents with an arbitrary calculation of equality, knowing that the shape of the solution would give me what I wanted.

I think this is directly translatable to XQuery, and so I will post such a solution to that list.

Good luck!

. . . . . . . . Ken

t:\ftemp\robby>type robby.xml
<?xml version="1.0" encoding="UTF-8"?>
<maps>
  <map href="Product1_map.xml"/>
  <map href="Product2_map.xml"/>
  <map href="Product3_map.xml"/>
  <map href="Product4_map.xml"/>
  <map href="Product5_map.xml"/>
</maps>

t:\ftemp\robby>type Product1_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product1_FandB.xml"/>
</map>

t:\ftemp\robby>type Product2_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product2_FandB.xml"/>
</map>

t:\ftemp\robby>type Product3_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product3_FandB.xml"/>
</map>

t:\ftemp\robby>type Product4_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product4_FandB.xml"/>
</map>

t:\ftemp\robby>type Product5_map.xml
<map>
  <features-benefits-ref href="features-benefits/Product5_FandB.xml"/>
</map>

t:\ftemp\robby>dir /s features-benefits
 Volume in drive T is VBOX_t
 Volume Serial Number is 0E00-0002

Directory of t:\ftemp\robby\features-benefits

2012-10-12  08:37               235 Product1_FandB.xml
2012-10-12  08:37               235 Product2_FandB.xml
2012-10-12  08:38               286 Product3_FandB.xml
2012-10-12  08:38               285 Product4_FandB.xml
2012-10-12  08:38               285 Product5_FandB.xml
               5 File(s)          1,326 bytes

     Total Files Listed:
               5 File(s)          1,326 bytes
               0 Dir(s)  16,795,488,256 bytes free

t:\ftemp\robby>type features-benefits\Product1_FandB.xml
<content>
<meta>
<id>product1</id>
</meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
</body>
</content>


t:\ftemp\robby>type features-benefits\Product2_FandB.xml
<content>
<meta>
<id>product2</id>
</meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
</body>
</content>


t:\ftemp\robby>type features-benefits\Product3_FandB.xml
<content>
<meta>
<id>product3</id>
</meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
<p>With additional text that is different</p>
</body>
</content>


t:\ftemp\robby>type features-benefits\Product4_FandB.xml
<content>
<meta>
<id>product4</id>
</meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
<p>With additional text that is the same</p>
</body>
</content>


t:\ftemp\robby>type features-benefits\Product5_FandB.xml
<content>
<meta>
<id>product5</id>
</meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
<p>With additional text that is the same</p>
</body>
</content>


t:\ftemp\robby>call xslt2 robby.xml robby.xsl out\robbyout.xml

t:\ftemp\robby>dir \s out
 Volume in drive T is VBOX_t
 Volume Serial Number is 0E00-0002

Directory of t:\


Directory of t:\ftemp\robby\out


2012-10-12  10:02    <DIR>          features-benefits
2012-10-12  10:14                94 Product1_map.xml
2012-10-12  10:14                94 Product2_map.xml
2012-10-12  10:14                84 Product3_map.xml
2012-10-12  10:14                94 Product4_map.xml
2012-10-12  10:14                94 Product5_map.xml
2012-10-12  10:14               371 robbyout.xml
               6 File(s)          1,001 bytes
               1 Dir(s)  16,795,488,256 bytes free

t:\ftemp\robby>type out\robbyout.xml
<?xml version="1.0" encoding="UTF-8"?>
<maps><!--features-benefits/Product1_FandB.xml.group.xml-->
<map href="Product1_map.xml"/>
<map href="Product2_map.xml"/>
<!--features-benefits/Product3_FandB.xml-->
<map href="Product3_map.xml"/>
<!--features-benefits/Product4_FandB.xml.group.xml-->
<map href="Product4_map.xml"/>
<map href="Product5_map.xml"/>
</maps>
t:\ftemp\robby>type out\Product1_map.xml
<map>
<features-benefits-ref href="features-benefits/Product1_FandB.xml.group.xml"/>
</map>
t:\ftemp\robby>type out\Product2_map.xml
<map>
<features-benefits-ref href="features-benefits/Product1_FandB.xml.group.xml"/>
</map>
t:\ftemp\robby>type out\Product3_map.xml
<map>
<features-benefits-ref href="features-benefits/Product3_FandB.xml"/>
</map>
t:\ftemp\robby>type out\Product4_map.xml
<map>
<features-benefits-ref href="features-benefits/Product4_FandB.xml.group.xml"/>
</map>
t:\ftemp\robby>type out\Product5_map.xml
<map>
<features-benefits-ref href="features-benefits/Product4_FandB.xml.group.xml"/>
</map>
t:\ftemp\robby>type out\features-benefits\Product1_FandB.xml.group.xml
<content>
<meta>
<id>
<!-- - features-benefits/Product1_FandB.xml-->
<!-- - features-benefits/Product2_FandB.xml-->
</id>
</meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
</body>
</content>
t:\ftemp\robby>type out\features-benefits\Product3_FandB.xml
<content>
<meta>
<id/>
</meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
<p>With additional text that is different</p>
</body>
</content>
t:\ftemp\robby>type out\features-benefits\Product4_FandB.xml.group.xml
<content>
<meta>
<id>
<!-- - features-benefits/Product4_FandB.xml-->
<!-- - features-benefits/Product5_FandB.xml-->
</id>
</meta>
<body>
<p>Suitable for high frequency applications due to fast switching characteristics</p>
<p>Suitable for logic level gate drive sources</p>
<p>With additional text that is the same</p>
</body>
</content>
t:\ftemp\robby>type robby.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">


<xsl:output indent="yes"/>

<xsl:template match="maps">
<xsl:variable name="maps" select="map"/>
<!--walk across all maps, acting on the first one that has unique content-->
<maps>
<xsl:for-each select="$maps">
<xsl:variable name="map-href" select="@href"/>
<!-- <xsl:message select="$map-href"/>
<xsl:message select="generate-id(doc(doc(@href)/*/features-benefits-ref/@href))"/>
<xsl:message select="count(
$maps[deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body,
doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body)])"/>
<xsl:message select="
$maps[deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body,
doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body)]/generate-id(.)"/>
-->
<xsl:if test="generate-id(.)=generate-id
($maps[deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body,
doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body)][1])">
<!--found the first one of the group with this body content-->
<xsl:variable name="current-group" select="$maps[
deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body,
doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body)]"/>
<xsl:variable name="count-current-group"
select="count($current-group)"/>
<xsl:variable name="new-file-href"
select="concat(doc($map-href)/*/features-benefits-ref/@href,
if( $count-current-group=1 )
then '' else '.group.xml' )"/>
<!--just for information, note this in the result map of maps-->
<xsl:comment select="$new-file-href"/><xsl:text>&#xa;</xsl:text>
<xsl:for-each select="$current-group">
<!--reference the map file-->
<map href="{@href}"/>
<!--recreate the map file-->
<xsl:result-document href="{@href}" omit-xml-declaration="yes">
<map>
<features-benefits-ref href="{$new-file-href}"/>
</map>
</xsl:result-document>
</xsl:for-each>
<!--recreate the content file-->
<xsl:result-document href="{$new-file-href}"
omit-xml-declaration="yes">
<content>
<meta>
<id>
<xsl:choose>
<xsl:when test="$count-current-group=1">
<xsl:copy-of select="node()"/>
</xsl:when>
<xsl:otherwise>
<xsl:for-each select="$current-group">
<xsl:text>&#xa;</xsl:text>
<xsl:comment select="string(.),
'-',doc(@href)/*/features-benefits-ref/@href"/>
</xsl:for-each>
<xsl:text>&#xa;</xsl:text>
</xsl:otherwise>
</xsl:choose>
</id>
</meta>
<xsl:copy-of
select="doc(doc(@href)/*/features-benefits-ref/@href)/*/body"/>
</content>
</xsl:result-document>
</xsl:if>
</xsl:for-each>
</maps>
</xsl:template>


</xsl:stylesheet>

--
Contact us for world-wide XML consulting and instructor-led training
Free 5-hour lecture: http://www.CraneSoftwrights.com/links/udemy.htm
Crane Softwrights Ltd.            http://www.CraneSoftwrights.com/s/
G. Ken Holman                   mailto:gkholman@xxxxxxxxxxxxxxxxxxxx
Google+ profile: https://plus.google.com/116832879756988317389/about
Legal business disclaimers:    http://www.CraneSoftwrights.com/legal


Current Thread
Keywords