[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] tokenize


Subject: Re: [xsl] tokenize
From: "G. Ken Holman" <gkholman@xxxxxxxxxxxxxxxxxxxx>
Date: Fri, 14 Oct 2011 09:00:46 -0400

At 2011-10-14 13:51 +0100, Peter Flynn wrote:
It's either my brain slowing down, or the fact that it's nearly the
weekend, or my lack of sleep and coffee, but I can't understand this: I
need to break up the content of a td element which represents a Unix
filepath, tokenizing on slashes, and getting rid of bogus visual formatting:

  <xsl:template match="h:tbody/h:tr">
    <!-- tokenise the uri so that we only extract valid data, eg
         <td class="xl">
&#160;&#160;&#160;&#160;/researchprofiles/A015/pcrowley/</td>
    -->
    <xsl:variable name="uri">
      <xsl:value-of
           select="translate(h:td[@class='xl'],'&#160;&#xa;','')"/>
    </xsl:variable>

The above could simply be:


 <xsl:variable name="uri"
               select="translate(h:td[@class='x1'],'&#160;&#xa;','')"/>

... because you were creating a temporary tree of a root node and a text node when all you need is a string, thus needing only to use the select= on the <xsl:variable>.

    <xsl:variable name="urifrag" select="tokenize($uri,'/')"/>
    <xsl:text>"</xsl:text>
    <xsl:value-of select="$urifrag[1]"/>
    <xsl:text>" </xsl:text>
    <xsl:text>&#xa;</xsl:text>
    ...
  </xsl:template>

(the commented example is Tidy'd output from the 'analog' web logfile
analyser). The result for the example td element is output as:

"/researchprofiles/A015/pcrowley"

That surprises me ... I would have expected "" because tokenize produces an empty string in front of the first "/". If you look on pages 300 and 303 of my XSLT book here you will see that tokenize() produces a non-matching substring before the first match:


http://www.CraneSoftwrights.com/training/#ptux

When covering this in the classroom, I have to point out the nuance of the first non-matching string. Here is an example from page 303:

tokenize(" a ","\s+") produces the three strings "", "a", ""

This is illustrated by doing the following with your string:

~/t/ftemp $ cat peter.xsl
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

  <xsl:template match="/">
    <xsl:variable name="in">
<xsl:value-of
          select="'&#160;&#xa;&#xa;&#160;/researchprofiles/A015/pcrowley/'"/>
    </xsl:variable>
    <xsl:variable name="uri" select="translate($in,'&#160;&#xa;','')"/>
    <xsl:variable name="urifrag" select="tokenize($uri,'/')"/>
Tokens:
<xsl:for-each select="$urifrag">
  <xsl:value-of select="concat('*',.,'* ')"/>
</xsl:for-each>
End
</xsl:template>
</xsl:stylesheet>
~/t/ftemp $ xslt2 peter.xsl peter.xsl
<?xml version="1.0" encoding="UTF-8"?>
Tokens:
** *researchprofiles* *A015* *pcrowley* **
End
~/t/ftemp $

In other words, not only has it not tokenized the string, but something
has gobbled the trailing slash from the input content. I suspected that
there was some character encoding error (slashes except the final one
not being real slashes, perhaps) but they are all genuine.

You don't say which processor you are using ... I'm using Saxon above.


I have clearly misunderstood how tokenize works (except that I have been
using it perfectly happily elsewhere for years). The variable $urifrag
seems to be returning the entire string rather than breaking it up,
except for the trailing slash, which means it is actually splitting the
string on its final slash only, instead of on all slashes.

From that, I deduce that I have mis-expressed the variable or the
function, but it isn't apparent where or how.

I can't see it either because I cannot reproduce your results. Even when I use the wasteful tree version of your variable I get the same results. Please try the above stylesheet in your environment and see if you get the same results.


I hope this helps.

. . . . . . . . . . . Ken

--
Contact us for world-wide XML consulting and instructor-led training
Crane Softwrights Ltd.            http://www.CraneSoftwrights.com/s/
G. Ken Holman                   mailto:gkholman@xxxxxxxxxxxxxxxxxxxx
Google+ profile: https://plus.google.com/116832879756988317389/about
Legal business disclaimers:    http://www.CraneSoftwrights.com/legal


Current Thread
Keywords