[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] special character encoding, two problems

Subject: Re: [xsl] special character encoding, two problems
From: "Jonina Dames jdames@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 23 Oct 2014 20:39:00 -0000

Thanks for the advice! The <xsl:value-of 
/> function works for most of the entities, but it's still missing a 
couple dozen characters. Some of the author names still have unicode 
entities instead of plain ascii (for example, several characters with a 
stroke, several ligatures, thorn characters, upper and lowercase). Is 
there a variation of this function or a parameter that will catch and 
convert ALL of these to plain ascii, as well as the standard acute and 
cedil characters? Or do I need to address these outlying characters with 
something else (not translate, since I can't use a one-to-one 
replacement for ligature entities)?


On Wed, Oct 15, 2014 at 05:56:40PM -0000, Jonina Damesjdames@xxxxxxxxx  scripsit:

> Problem 2:
> I'm trying to use a stylesheet with a character map so I can convert
> accented letters to their plain ascii equivalents in a surname element of my
> output XML to create indexing values. I'm new to XSLT 2.0 and I'm having
> trouble figuring out the syntax so my mappings will work correctly. Is there
> a simpler way to convert numeric unicode entities of accented letters to
> plain ascii characters, or is this my best bet?

First off, the instant the XML document is parsed, all the numeric
entities ought to get converted straight to the character represented by
that entity, so when you're using XSLT to process the document you're
dealing with regular old characters of whatever code point.

If what you want to do is to take the accented characters and remove
their accents, leaving the base character behind, the traditional way to
remove accents is the translate() function, thus -- translate(.,'i','e')
-- and you just keep going for all the characters you want to de-accent
when you create the surname element's string contents.

     <xsl:value-of select="translate(.,'@BGIHJKNOT[Y\Y"Q`bgihjknot{y|q','AACEEEEIIOUUUYNaaceeeeiiouuuyn')"/>

This is pretty much like the character maps only character maps aren't
obliged to be one-to-one, which translate()'s replacement of characters
_is_.  (the first character in the search list gets swapped for the
first character in the replacement list, and so on.)

If you've got XSLT 2.0, it's much better (because you don't have
to list every accented character that might show up!) to use

     <xsl:value-of select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''),'NFKC')" />

because that will get everything without any explicit list requirement.

You're normalizing the Unicode string to decomposed form -- so the "e"
and the accent-aigu are separate code points -- then using Unicode
character categories to delete all the "Mark, Nonspacing" characters
(all the accents) via replace(), and then re-normalizing the result back
into the composed form which XSLT expects.

Both examples assume you're operating on the context node (that dot
character) which you might not be.

-- Graydon

On 10/16/14 5:24 AM, XSL-List: The Open Forum on XSL wrote:
> This message contains the recent posts to the XSL-List: The Open Forum on XSL
> mailing list managed by Mulberry Technologies, Inc. (http://lists.mulberrytech.com).

Jonina Dames
Customer Support Specialist
Inera Inc.
+1 617 932 1932
eXtyles on Twitter <https://twitter.com/extyles>

This email message and any attachments are confidential. If you are not 
the intended recipient, please immediately reply to the sender or call 
617-932-1932 and delete the message from your email system. Thank you.

Current Thread