[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] special character encoding, two problems


Subject: Re: [xsl] special character encoding, two problems
From: "Wolfgang Laun wolfgang.laun@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 24 Oct 2014 19:24:09 -0000

There are typographical ligatures (ff, ffl, fi,...) which are there to
improve the optical impression of a printed type. These can be replaced by
their grammatically correct equivalents( "ff", "ffl", "fi",...). This is
not true for ligatures where letters have morphed into new letters.

The "dotless i" is a letter in its own right, at least in Turkish, along
with its dotted variant.

-W


On 24 October 2014 19:11, Jonina Dames jdames@xxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>  Thanks Michael,
>
> A couple of follow-up questions:
>
> Why does it matter whether or not ligatures are used in English? I note
> that ligatures like &fflig; correctly convert to "ff", so why not other
> ligatures?
>
> Understood about xf8, but why not x131 (i with no dot), or any of the
> other characters in my list?
>
> Thanks,
> Joni
>
>
>
> On 10/24/14 12:54 PM, Michael Kay mike@xxxxxxxxxxxx wrote:
>
>
> What I'm unclear on is why the function is correctly converting "&#x00E9;"
> to "e", but not "&#xf8;" to "o".
>
>
>  Because Unicode normalization into decomposed form does not split xf8
> into an "o" and a "/" modifier. Don't ask me why, probably there were some
> voluble and well educated Swedes on the committee who insisted that xf8 was
> not a modified "o".
>
>  Some of the characters below are ligatures, e.g. C and C& and E, some
> (like thorn) are first-class letters in their own right that just happen
> not to be used in English.
>
>  If you only need to transliterate these characters, and not the whole of
> Cyrillic, Greek, Hebrew, etc, then I think you would be best off just
> enumerating them.
>
>  Michael Kay
> Saxonica
>
>
>  Is there a way to make this function convert all accented latin letters
> to plain ascii characters? We really need coverage for any letter that can
> appear in a European name, so this should also convert the numeric
> character reference for thorn (C>, &#xfe;) to one or more plain ascii
> characters, to cover authors from Iceland.
>
> I ran a broad test of all the accented latin letters most likely to occur
> in author names, and these 28 characters are the only ones that were not
> converted to plain ascii equivalents:
>
> &#xc6;    C
> &#xd0;    C
> &#xd8;    C
> &#xde;    C
> &#xdf;    C
> &#xe6;    C&
> &#xf0;    C0
> &#xf8;    C8
> &#xfe;    C>
> &#x110;    D
> &#x111;    D
> &#x126;    D&
> &#x127;    D'
> &#x131;    D1
> &#x141;    E
> &#x142;    E
> &#x14a;    E

> &#x14b;    E
> &#x152;    E
> &#x153;    E
> &#x166;    E&
> &#x167;    E'
> &#x180;    F
> &#x197;    F
> &#x1b5;    F5
> &#x1b6;    F6
> &#x1e4;    G$
> &#x1e5;    G%
>
> Is there a different set of flags for this function that will yield the
> result I'm looking for? If this function cannot do that, what is the best
> way to convert all of these outlying characters? I need this conversion to
> happen in only one element of my XML, not the entire XML document. I can't
> use translate because it's a one-to-one conversion that doesn't cover the
> ligatures listed above. If normalize-unicode cannot be made to cover all
> the characters listed above, can character-maps be applied that act
> specifically on only one element?
>
> Thanks,
> Joni
>
>
>
> On 10/24/14 9:11 AM, Eliot Kimber ekimber@xxxxxxxxxxxx wrote:
>
> I can't restrain my own pedantry: the correct term is "numeric character
> reference", not "numeric entity": http://www.w3.org/TR/REC-xml/#dt-charref
>
> Given that I think I'm the only person who ever uses the term correctly
> and consistently, we probably should have just used "numeric entity" but
> so it goes.
>
> Cheers,
>
> E.
> bbbbb
> Eliot Kimber, Owner
> Contrext, LLChttp://contrext.com
>
>
>
>
> On 10/23/14, 4:13 PM, "Graydon
graydon@xxxxxxxxx"<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>
>  On Thu, Oct 23, 2014 at 08:39:11PM -0000, Jonina Dames jdames@xxxxxxxxx
> scripsit:
>
>  Thanks for the advice! The <xsl:value-of
>
> select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''
> ),'NFKC')"
> /> function works for most of the entities, but it's still missing a
> couple dozen characters.
>
>  Terminology pedant time --
>
> &#x00e9; is a numeric entity and exactly the same thing as C) just
> written differently.
>
> &eacute; is a named entity reference (which had better be defined
> somewhere)
>
> Either, as soon as the XML document is parsed, turns into U+00E9 in some
> internal representation and they're not different from each other or the
> representation for C) if someone had typed that directly in the utf-8
> input file.
>
> So when you say "entity" here I'm getting the nervous feeling that I
> don't know what you mean; can you provide some examples?
>
>
>  Some of the author names still have unicode entities instead of plain
> ascii (for example, several characters with a stroke, several
> ligatures, thorn characters, upper and lowercase). Is there a
>
>  Well, examples would be good, but thorn, for example, &#x00FE; which is
> the self-same code point as C>, doesn't involve a modifier; it's one
> whole letter that doesn't exist inside ASCII.
>
> Stripping the modifiers -- which will give you e from C) if you decompose
> C) first, because then it's e + K
, which you could write &#x0065; +
> &#x0301; and it would be the same -- doesn't do anything because there
> is no modifier there, it's just the single code-point for thorn.
>
>
>  variation of this function or a parameter that will catch and convert
> ALL of these to plain ascii, as well as the standard acute and cedil
> characters? Or do I need to address these outlying characters with
> something else (not translate, since I can't use a one-to-one
> replacement for ligature entities)?
>
>  ASCII, strictly, is seven-bit; there are lots of things you can't
> represent in ASCII.  &#x00e9; *is not* ASCII just because those eight
> characters all happen to be ASCII characters.
>
> So it sounds like you're trying to (either) map U+00FE, C>, to &thorn; or
> something like that (which is not, I cannot stress too much, ASCII; it
> might be an ASCII representation of a non-ASCII code-point, but it's
> still a non-ASCII code-point) or have C> decompose into t+h or something
> of that ilk.  (Which is at least actually ASCII.)
>
> Either way you'd have to use character mappings for those; there aren't
> any modifiers to remove.
>
> Are you really compelled to deliver seven bit ASCII?
>
> And, please, some examples.
>
> -- Graydon
>
>
>
>
>
> --
> Jonina Dames
> Customer Support Specialist
> Inera Inc.
> +1 617 932 1932
> eXtyles on Twitter <https://twitter.com/extyles>
> jdames@xxxxxxxxx
>
> -----------------------------------------------------------------
> This email message and any attachments are confidential. If you are not
> the intended recipient, please immediately reply to the sender or call
> 617-932-1932 and delete the message from your email system. Thank you.
> -------------------------------------------------------------------
>    XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://-list/293509> (by email)
>
>
>   XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://-list/255718> (by email)
>
>
>
> --
> Jonina Dames
> Customer Support Specialist
> Inera Inc.
> +1 617 932 1932
> eXtyles on Twitter <https://twitter.com/extyles>
> jdames@xxxxxxxxx
>
> -----------------------------------------------------------------
> This email message and any attachments are confidential. If you are not
> the intended recipient, please immediately reply to the sender or call
> 617-932-1932 and delete the message from your email system. Thank you.
> -------------------------------------------------------------------
>    XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <-list/528976> (by
> email <>)


Current Thread
Keywords