Need to tokenize words that are ALL CAPS

Here should go questions about transforming XML with XSLT and FOP.
sderrick
Posts: 210

Need to tokenize words that are ALL CAPS

Sat Feb 28, 2015 1:33 am

I need to tokenize words that are ALL CAPS in certain text() nodes.

I can find them using the regex expression '\b(?=[A-Z])[A-Z ]+(?=\W)' except xslt doesn't support boundary identifiers like \b (word boundary)

This regex will find the two word groups 'WE DOING' and 'BOB' in the string below.

How are WE DOING today suPER BOB.

Notice it will not find, 'How' or 'suPER', which is correct because I only want ALL CAP words.

Anybody know of another regex that will work in analyze-string that would do this without using \b?

thanks,

Scott

PS: Why the **** doesn't XSLT support standard regex identities that PHP, Java, UNIX, and every other regex engine supports? :(
Patrik
Posts: 227
Location: Hamburg/Germany

Re: Need to tokenize words that are ALL CAPS

Mon Mar 02, 2015 9:13 am

Hi Scott,

how about doint this in two steps:
1. Use analyze-string to identify words.
2. in matching-substring check if it only contains caps.

An alternative might be this rexexp: "(^|[\s])([A-Z\s]+)($|[\s])". With rexex-goup(2) you would get the words. However, the whitespace before/after this would NOT be accessible in non-matching-substring.

regards,

Patrik
sderrick
Posts: 210

Re: Need to tokenize words that are ALL CAPS

Tue Mar 03, 2015 2:07 am

I've decided to use

Code: Select all

 <xsl:analyze-string select="." regex="([A-Z]{{2,}})( [A-Z]{{2,}})*">
                    <xsl:matching-substring>
                        <xsl:element name="u">
                            <xsl:value-of select=" lower-case(.)"/>
                        </xsl:element>
                    </xsl:matching-substring>
                    <xsl:non-matching-substring>
                        <xsl:value-of select="."/>
                    </xsl:non-matching-substring>
                </xsl:analyze-string>


which is not optimal because without word boundary this will capture a string like

fooBAR FOR SNAfu and grab "BAR FOR SNU"

But as mixed ALLCAP-lowercase words are very rare its probably OK.

Is there a way to require a word boundary character at the start and end of this without it being in the capture group?

Return to “XSLT and FOP”

Who is online

Users browsing this forum: No registered users and 0 guests