[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] Re: text() word lists


Subject: [xsl] Re: text() word lists
From: "Dimitre Novatchev" <dnovatchev@xxxxxxxxx>
Date: Sat, 7 Feb 2004 08:59:52 +0100

Hi David,

I also thought about this approach, but decided it would create new problems
in addition to the problem it attempts to solve.

Obviously, it may not work for another language (not English), and even for
English some of the element names may generated may not be syntactically
correct.

Some correct words may be lost. E.g. words that contain hyphen or start or
end with an apostrophe or contain an apostrophe.

Of course, I am not a specialist in English, but it seems to me that there
would be quite a number of special cases, covering which would lead to code
explosion.

Of course, these problems will partially exists for the standard
tokenization approach, too.


Cheers,

Dimitre Novatchev,
FXSL developer,

http://fxsl.sourceforge.net/ -- the home of FXSL
Resume: http://fxsl.sf.net/DNovatchev/Resume/Res.html


"McNally, David" <David.McNally@xxxxxxxxxx> wrote in message
news:0E004801DA49714E942CE230B535762606EC8E@xxxxxxxxxxxxxxxxxxxxxxxxxxx
> You can do this in XSLT 1.0, with the nodeset function, by doing multiple
> passes on the file.  Basically, in the first pass, you get rid of all the
> elements, and then turn all of the words in the document into empty
> elements.  You can then manipulate the words as elements, and there are
> probably any number of ways to get to your final result.  Here, I'm doing
> another pass just to add on count attributes to each element, and then a
> final pass to output the results, first sorted alphabetically, then by
> count.
>
> It seems to work, though I haven't done enough testing to be sure that it
> doesn't quietly mess things up.  Also, I'm not sure how well it's going to
> work on big files - but in that situation you probably should think about
> using perl or something.
>
> Hope this helps,
> David.
>
>
>
> Text_frequency_count.xslt:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet version="1.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> xmlns:ext="urn:schemas-microsoft-com:xslt" xmlns:rep="http://whatever.com"
> xmlns:saxon="http://icl.com/saxon">
> <xsl:output method="text" version="1.0" encoding="UTF-8"
> indent="yes"/>
>
> <xsl:variable name="nonwordchars"><xsl:text>.,:;!?
> "'()[]&lt;>{}@#$%^*-_+=|\~</xsl:text></xsl:variable>
> <xsl:variable name="lletters" select="'abcdefghijklmnopqrstuvwxyz'"/>
> <xsl:variable name="Uletters" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
> <xsl:variable name="numbers" select="1234567890"/>
>
>
> <xsl:template match="/">
>
> <xsl:variable name="document">
> <bagOfElements>
> <xsl:apply-templates select="*" mode="firstrun"/>
> </bagOfElements>
> </xsl:variable>
>
> <xsl:variable name="document2">
> <bagOfElements>
> <xsl:apply-templates select="ext:node-set($document)"
> mode="secondrun"/>
> </bagOfElements>
> </xsl:variable>
>
> <xsl:apply-templates select="ext:node-set($document2)/*"
> mode="finalrun"/>
>
> </xsl:template>
>
>
> <!-- FIRST RUN - get rid of all the elements, and turn words into elements
> -->
>
> <xsl:template match="*" mode="firstrun">
> <xsl:apply-templates mode="firstrun"/>
> </xsl:template>
>
> <xsl:template match="text()" mode="firstrun">
>
> <!-- loads of space characters in the final concat - basically
> anything that's not
> a letter gets translated into a space -->
> <xsl:variable name="text"
>
select="normalize-space(translate(.,concat($Uletters,$numbers,$nonwordchars,
> '&#9;', '&#10;'),
> concat($lletters,'
> ')))"/>
>
> <xsl:call-template name="elementify">
> <xsl:with-param name="text" select="$text"/>
> </xsl:call-template>
>
> </xsl:template>
>
> <xsl:template name="elementify">
> <xsl:param name="text"/>
>
> <xsl:choose>
> <xsl:when test="contains($text,' ')">
> <xsl:element name="{substring-before($text,' ')}"/>
> <xsl:call-template name="elementify">
> <xsl:with-param name="text"
> select="substring-after($text, ' ')"/>
> </xsl:call-template>
> </xsl:when>
> <xsl:otherwise>
> <xsl:if test="string-length($text) > 0">
> <xsl:element name="{$text}"/>
> </xsl:if>
> </xsl:otherwise>
> </xsl:choose>
>
> </xsl:template>
>
> <!-- SECOND RUN - just adding count attributes to make sorting easier in
> final run. -->
>
> <xsl:template match="bagOfElements" mode="secondrun">
> <xsl:for-each select="*">
> <xsl:element name="{name()}">
> <xsl:attribute name="count">
> <xsl:value-of
> select="count(/bagOfElements/*[name() = name(current())])"/>
> </xsl:attribute>
> </xsl:element>
> </xsl:for-each>
> </xsl:template>
>
>
> <!-- THE FINAL RUN -->
>
> <xsl:template match="bagOfElements" mode="finalrun">
> <xsl:text>
> Ordered By Name
>
> </xsl:text>
> <xsl:apply-templates  select="*" mode="finalrun">
> <xsl:sort select="name()"></xsl:sort>
> </xsl:apply-templates>
> <xsl:text>
> Ordered By Count
>
> </xsl:text>
> <xsl:apply-templates  select="*" mode="finalrun">
> <xsl:sort select="@count"></xsl:sort>
> </xsl:apply-templates>
>
> </xsl:template>
>
>
> <xsl:template match="*" mode="finalrun">
> <xsl:variable name="currentname" select="name()"/>
> <xsl:if test="not(preceding-sibling::*[name() = $currentname])">
> <xsl:value-of select="name()"/>
> <xsl:text>: </xsl:text>
> <xsl:value-of select="@count"/>
> <xsl:text>&#10;</xsl:text>
> </xsl:if>
> </xsl:template>
>
> </xsl:stylesheet>
>
>
> File:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <?xml-stylesheet type="text/xsl"
> href="C:\Work\xsl\text_frequency_count.xslt"?>
> <foo>
> <blort> This is a <wibble>Test</wibble>, only a test!</blort>
> <blort> This really is a <wibble>great big test</wibble>,
>  only a test!</blort>
> </foo>
>
>
> Output:
>
>
> Ordered By Name
>
> a: 4
> big: 1
> great: 1
> is: 2
> only: 2
> really: 1
> test: 4
> this: 2
>
> Ordered By Count
>
> really: 1
> great: 1
> big: 1
> this: 2
> is: 2
> only: 2
> a: 4
> test: 4
>
> > -----Original Message-----
> > From: James Cummings [mailto:James.Cummings@xxxxxxxxxxxxxx]
> > Sent: Friday, February 06, 2004 10:35 AM
> > To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> > Subject: [xsl] text() word lists
> >
> >
> >
> > Hi there,
> >
> > I'm sure this is a faq, and I've checked the faq and archive.
> > I swear I remember someone asking about it, but I couldn't
> > find it, so here goes.
> >
> > I want to take an XML file of unknown elements and create
> > a word frequency list / word list.  Now, an entry on sorting
> > in the xslt faq says this is just what xslt is bad at.  (And
> > I'm sure there are some that would say 'just go use perl',
> > but let's say I want to do it in xslt(1 or 2).
> >
> > XSLT2 makes the tokenization of strings much easier, so
> > assuming I'm using that, if I have:
> >
> > <foo>
> > <blort> This is a <wibble>Test</wibble>, only a test!</blort>
> > <blort> This really is a <wibble>great big test</wibble>,
> > only a test!</blort> </foo>
> >
> > I don't know that foo|wibble|blort  will be the element names.
> >
> > But I want to produce both:
> >
> > a  -- 4
> > test  -- 4
> > only -- 2
> > is  -- 2
> > this  -- 2
> > big -- 1
> > great -- 1
> > really -- 1
> >
> > Which (unless I've missed something) should be
> > a case-insensitive list grouped by frequency
> > sorted alphabetically within this, and ignoring
> > punctuation.
> >
> > But also:
> >
> > a  -- 4
> > big -- 1
> > great -- 1
> > is  -- 2
> > only -- 2
> > test  -- 4
> > this  -- 2
> > really -- 1
> >
> > Which is the same list by not grouped
> > by frequency.
> >
> > Suggestions? Solutions?
> >
> > Many thanks for any help,
> > -James
> > ---
> > Dr James Cummings, Oxford Text Archive, University of Oxford
> > James.Cummings at ota.ahds.ac.uk http://users.ox.ac.uk/~jamesc/
> >
> >  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
> >
>
>
> ---------------------------------------
>
> The information contained in this e-mail message, and any attachment
thereto, is confidential and may not be disclosed without our express
permission.  If you are not the intended recipient or an employee or agent
responsible for delivering this message to the intended recipient, you are
hereby notified that you have received this message in error and that any
review, dissemination, distribution or copying of this message, or any
attachment thereto, in whole or in part, is strictly prohibited.  If you
have received this message in error, please immediately notify us by
telephone, fax or e-mail and delete the message and all of its attachments.
Thank you.
>
> Every effort is made to keep our network free from viruses.  You should,
however, review this e-mail message, as well as any attachment thereto, for
viruses.  We take no responsibility and have no liability for any computer
virus which may be transferred via this e-mail message.
>
>
>  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
>
>




 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



Current Thread
Keywords