[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] special character encoding, two problems

Subject: Re: [xsl] special character encoding, two problems
From: "Graydon graydon@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 15 Oct 2014 18:23:16 -0000

On Wed, Oct 15, 2014 at 05:56:40PM -0000, Jonina Dames jdames@xxxxxxxxx scripsit:
> Problem 2:
> I'm trying to use a stylesheet with a character map so I can convert
> accented letters to their plain ascii equivalents in a surname element of my
> output XML to create indexing values. I'm new to XSLT 2.0 and I'm having
> trouble figuring out the syntax so my mappings will work correctly. Is there
> a simpler way to convert numeric unicode entities of accented letters to
> plain ascii characters, or is this my best bet?

First off, the instant the XML document is parsed, all the numeric
entities ought to get converted straight to the character represented by
that entity, so when you're using XSLT to process the document you're
dealing with regular old characters of whatever code point.

If what you want to do is to take the accented characters and remove
their accents, leaving the base character behind, the traditional way to
remove accents is the translate() function, thus -- translate(.,'C)','e')
-- and you just keep going for all the characters you want to de-accent
when you create the surname element's string contents.

    <xsl:value-of select="translate(.,'CCCC	CC
CCCCCCCE8CC C"C'C)C(C*C+C.C/C4C;C9C<C?C1','AACEEEEIIOUUUYNaaceeeeiiouuuyn')"/>

This is pretty much like the character maps only character maps aren't
obliged to be one-to-one, which translate()'s replacement of characters
_is_.  (the first character in the search list gets swapped for the
first character in the replacement list, and so on.)

If you've got XSLT 2.0, it's much better (because you don't have
to list every accented character that might show up!) to use

    <xsl:value-of select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''),'NFKC')" />

because that will get everything without any explicit list requirement.  

You're normalizing the Unicode string to decomposed form -- so the "e"
and the accent-aigu are separate code points -- then using Unicode
character categories to delete all the "Mark, Nonspacing" characters
(all the accents) via replace(), and then re-normalizing the result back
into the composed form which XSLT expects.

Both examples assume you're operating on the context node (that dot
character) which you might not be.

-- Graydon

Current Thread