[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] XSLT Solution for hyphenation


Subject: [xsl] XSLT Solution for hyphenation
From: Jeff Sese <jsese@xxxxxxxxxxxx>
Date: Fri, 22 Dec 2006 14:09:50 +0800

Hi list,

I have this project that applies hyphenation to an XML document using a list of words as a reference. The list of words can reach up to a million entries.
My XSLT solution was having a template that matches text() nodes then insert hyphens to the matching words that are in the list. However the transformation takes to long to finish even for a relatively small file (around 1mb). Is there anyway to speed this or is there a better solution?


Here's my stylesheet:

<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="@*|element()|comment()|processing-instruction()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:variable name="str" select="."/>
<xsl:variable name="searchStrs" as="xs:string*" select="$search-words[matches($str,.)]/replace(.,'[.\\?*+{}()\[\]\^\$&#x007C;]', '\\$0')"/>
<xsl:value-of select="ati:replace-all($str,$searchStrs,$replaceStr)"/>
</xsl:template>
<xsl:function name="ati:replace-all">
<xsl:param name="input" as="xs:string"/>
<xsl:param name="words-to-replace" as="xs:string*"/>
<xsl:sequence select="if (exists($words-to-replace)) then ati:replace-all(replace($input, $words-to-replace[1], key('replace',$words-to-replace[1],$search-words)),remove($words-to-replace,1)) else $input"/>
</xsl:function>


heres a sample of the look-up table:

<root>
   <wordlist>
       <entry>
           <search>abaissassent</search>
           <replace>abais&#x00AD;sassent</replace>
       </entry>
       <entry>
           <search>abaisshrent</search>
           <replace>abais&#x00AD;shrent</replace>
       </entry>
       <entry>
           <search>abandonnent</search>
           <replace>aban&#x00AD;donnent</replace>
       </entry>
   </wordlist>
</root>

so if i have a "abaissassent" in a text() node this will be replaced with "aban&#x00AD;donnent".

--
*Jeff*


Current Thread
Keywords