Need to tokenize words that are ALL CAPS

Here should go questions about transforming XML with XSLT and FOP.
sderrick
Posts: 264
Joined: Sat Jul 10, 2010 4:03 pm

Need to tokenize words that are ALL CAPS

Post by sderrick »

I need to tokenize words that are ALL CAPS in certain text() nodes.

I can find them using the regex expression '\b(?=[A-Z])[A-Z ]+(?=\W)' except xslt doesn't support boundary identifiers like \b (word boundary)

This regex will find the two word groups 'WE DOING' and 'BOB' in the string below.

How are WE DOING today suPER BOB.

Notice it will not find, 'How' or 'suPER', which is correct because I only want ALL CAP words.

Anybody know of another regex that will work in analyze-string that would do this without using \b?

thanks,

Scott

PS: Why the **** doesn't XSLT support standard regex identities that PHP, Java, UNIX, and every other regex engine supports? :(
Patrik
Posts: 280
Joined: Thu Nov 28, 2013 9:32 am
Location: Hamburg/Germany
Contact:

Re: Need to tokenize words that are ALL CAPS

Post by Patrik »

Hi Scott,

how about doint this in two steps:
1. Use analyze-string to identify words.
2. in matching-substring check if it only contains caps.

An alternative might be this rexexp: "(^|[\s])([A-Z\s]+)($|[\s])". With rexex-goup(2) you would get the words. However, the whitespace before/after this would NOT be accessible in non-matching-substring.

regards,

Patrik
sderrick
Posts: 264
Joined: Sat Jul 10, 2010 4:03 pm

Re: Need to tokenize words that are ALL CAPS

Post by sderrick »

I've decided to use

Code: Select all


 <xsl:analyze-string select="." regex="([A-Z]{{2,}})( [A-Z]{{2,}})*">
<xsl:matching-substring>
<xsl:element name="u">
<xsl:value-of select=" lower-case(.)"/>
</xsl:element>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
which is not optimal because without word boundary this will capture a string like

fooBAR FOR SNAfu and grab "BAR FOR SNU"

But as mixed ALLCAP-lowercase words are very rare its probably OK.

Is there a way to require a word boundary character at the start and end of this without it being in the capture group?
Post Reply