Need to tokenize words that are ALL CAPS

Post by **sderrick** » Sat Feb 28, 2015 1:33 am

I need to tokenize words that are ALL CAPS in certain text() nodes.

I can find them using the regex expression '\b(?=[A-Z])[A-Z ]+(?=\W)' except xslt doesn't support boundary identifiers like \b (word boundary)

This regex will find the two word groups 'WE DOING' and 'BOB' in the string below.

How are WE DOING today suPER BOB.

Notice it will not find, 'How' or 'suPER', which is correct because I only want ALL CAP words.

Anybody know of another regex that will work in analyze-string that would do this without using \b?

thanks,

Scott

PS: Why the **** doesn't XSLT support standard regex identities that PHP, Java, UNIX, and every other regex engine supports?

Post by **Patrik** » Mon Mar 02, 2015 9:13 am

Hi Scott,

how about doint this in two steps:
1. Use analyze-string to identify words.
2. in matching-substring check if it only contains caps.

An alternative might be this rexexp: "(^|[\s])([A-Z\s]+)($|[\s])". With rexex-goup(2) you would get the words. However, the whitespace before/after this would NOT be accessible in non-matching-substring.

regards,

Patrik

Post by **sderrick** » Tue Mar 03, 2015 2:07 am

I've decided to use

Code: Select all


 <xsl:analyze-string select="." regex="([A-Z]{{2,}})( [A-Z]{{2,}})*">

                    <xsl:matching-substring>

                        <xsl:element name="u">

                            <xsl:value-of select=" lower-case(.)"/>

                        </xsl:element>

                    </xsl:matching-substring>

                    <xsl:non-matching-substring>

                        <xsl:value-of select="."/>

                    </xsl:non-matching-substring>

                </xsl:analyze-string>

which is not optimal because without word boundary this will capture a string like

fooBAR FOR SNAfu and grab "BAR FOR SNU"

But as mixed ALLCAP-lowercase words are very rare its probably OK.

Is there a way to require a word boundary character at the start and end of this without it being in the capture group?

Need to tokenize words that are ALL CAPS

Need to tokenize words that are ALL CAPS

Re: Need to tokenize words that are ALL CAPS

Re: Need to tokenize words that are ALL CAPS