[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
Re: [xsl] Unicode accented letters to simple ASCII equivalents
Subject: Re: [xsl] Unicode accented letters to simple ASCII equivalents From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx> Date: Fri, 10 May 2013 20:03:52 +0200 |
On 10/05/2013, Liam R E Quin <liam@xxxxxx> wrote: > On Fri, 2013-05-10 at 13:07 -0400, Jonina Dames wrote: >> Hi everyone, >> >> I'm trying to re-map any accented letters from Unicode (in the XML) to >> simple, un-accented ASCII equivalents (in an output text file). A crude escape of not being able to handle diacritics. > > It depends on several things: > (1) the normalization form of the input; > (2) what you mean by un-accented ASCII equivalents. > (3) whether you can call extension functions > > I'm not trying to split hairs - for example C< (u with two strokes over > it, or two dots) should usually turn to ue without the accent, C$, C6, C< represented as ae, oe, ue is acceptable in German, but in Hungarian, replacing O, o, U or u with either diaeresis or double acute by the vowel followed by an 'e' would be met with incomprehension. In both languages, diacritics do not express emphasis or length: they are used to denote an entirely different letter, with pronounciation that differs from the one of the unadorned vowel. Needless to say, evil changes in the meaning of the word would result, sometimes even from harmless to four-letterish (szC!r - szar = stem - sh*t). > and if > you need all-seven-bit output C can turn to Ae or AE, a: to ss, These are not strictly letters with diacritical marks, but I suspect that the original problem is either much simpler (and hasn't been posted as such) or much more than the elimination of diacritical marks. > and so on. > > One way to handle this might be an XML file with specific characters in > an attribute and replacements in content (say), and use key() in XSLT to > look each character up in turn. > > Watch that C) can be represented in two ways in Unicode - the precombined > character or an e followed by an acute accent. This is true for any letter with a diacritical mark - Normalization Form Canonical Decomposition (NFD) separates the letter from its "decorations". > You can handle the > multiple-character case in XSLT 2 with replace(), and some XSLT 1 > implementations might have a regular expression library available. > > Normalizing the input first into a fully-decomposed form and then using > a regular expression, or using translate() to identify characters > outside the ASCII range and skipping them, might be simplest. Java provides java.text.Normalizer for transforming a text into the NFD form, which would solve the first part. XQuery 1.0 and XPath 2.0 Functions and Operators has fn:normalize-unicode to achieve this split. > > I won't speculate further because it really depends on the exact > software and environment and on your purpose and exact problem. Indeed. Wolfgang > > Liam
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Unicode accented letters , Liam R E Quin | Thread | Re: [xsl] Unicode accented letters , Liam R E Quin |
Re: [xsl] Unicode accented letters , Liam R E Quin | Date | Re: [xsl] Unicode accented letters , Liam R E Quin |
Month |