Page 1 of 1

Convert Diacritical Unicode Character and Punctuation Codes

Posted: Sat Nov 27, 2010 7:28 am
by jdrouin
I have a TEI Tite XML file containing a French text with thousands of diacritical characters. Though the document header declares it is encoded in UTF-8...

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
... the codes for diacritical characters and punctuation are in a different format, i.e.:

Code: Select all

s&#x2019;est &#x00E9;coul&#x00E9; jusqu&#x2019;&#x00E0; son r&#x00E9;veil; mais leurs rangs peuvent se m&#x00EA;ler
I am trying to use the PhiloLogic text mining tool to analyze this text, but it won't find diacritical characters or punctuation unless they're in UTF-8.

How do I convert only the diacritical character and punctuation codes, as above, to UTF-8 codes?

Thanks,

Jeff

Re: Convert Diacritical Unicode Character and Punctuation Codes

Posted: Sat Nov 27, 2010 7:31 am
by jdrouin
Sorry, forgot to say I've been trying to use oXygen to change those codes but just can't seem to find anything about it in the help system. Hoping someone here can point me in the right direction.

Best,

Jeff

Re: Convert Diacritical Unicode Character and Punctuation Codes

Posted: Mon Nov 29, 2010 10:10 am
by Radu
Hi Jeff,

First of all, having characters which are escaped as character entities is perfectly legal in XML documents and if a tool does not handle them properly then it is not 100% XML conformant.

You can select the entire XML file content in Oxygen, right click, in the contextual menu go to Source->Unescape selection. Uncheck all the checkboxes and then check only the "Unescape Characters" checkbox. This should do what you want to accomplish.

Regards,
Radu