[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Unicode accented letters to simple ASCII equivalents


Subject: Re: [xsl] Unicode accented letters to simple ASCII equivalents
From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
Date: Fri, 10 May 2013 20:03:52 +0200

On 10/05/2013, Liam R E Quin <liam@xxxxxx> wrote:
> On Fri, 2013-05-10 at 13:07 -0400, Jonina Dames wrote:
>> Hi everyone,
>>
>> I'm trying to re-map any accented letters from Unicode (in the XML) to
>> simple, un-accented ASCII equivalents (in an output text file).

A crude escape of not being able to handle diacritics.

>
> It depends on several things:
> (1) the normalization form of the input;
> (2) what you mean by un-accented ASCII equivalents.
> (3) whether you can call extension functions
>
> I'm not trying to split hairs - for example C< (u with two strokes over
> it, or two dots) should usually turn to ue without the accent,

C$, C6, C<  represented as ae, oe, ue is acceptable in German, but in
Hungarian, replacing O, o, U or u with either diaeresis or double
acute by the vowel followed by an 'e' would be met with
incomprehension. In both languages, diacritics do not express emphasis
or length: they are used to denote an entirely different letter, with
pronounciation that differs from the one of the unadorned vowel.
Needless to say, evil changes in the meaning of the word would result,
sometimes even from harmless to four-letterish (szC!r - szar = stem -
sh*t).

> and if
> you need all-seven-bit output C can turn to Ae or AE, a: to ss,

These are not strictly letters with diacritical marks, but I suspect
that the original problem is either much simpler (and hasn't been
posted as such) or much more than the elimination of diacritical
marks.

> and so on.
>
> One way to handle this might be an XML file with specific characters in
> an attribute and replacements in content (say), and use key() in XSLT to
> look each character up in turn.
>
> Watch that C) can be represented in two ways in Unicode - the precombined
> character or an e followed by an acute accent.

This is true for any letter with a diacritical mark - Normalization
Form Canonical Decomposition (NFD) separates the letter from its
"decorations".

> You can handle the
> multiple-character case in XSLT 2 with replace(), and some XSLT 1
> implementations might have a regular expression library available.
>
> Normalizing the input first into a fully-decomposed form and then using
> a regular expression, or using translate() to identify characters
> outside the ASCII range and skipping them, might be simplest.

Java provides java.text.Normalizer for transforming a text into the
NFD form, which would solve the first part. XQuery 1.0 and XPath 2.0
Functions and Operators has fn:normalize-unicode to achieve this
split.

>
> I won't speculate further because it really depends on the exact
> software and environment and on your purpose and exact problem.

Indeed.

Wolfgang

>
> Liam


Current Thread