Convert Diacritical Unicode Character and Punctuation Codes

Questions about XML that are not covered by the other forums should go here.
jdrouin
Posts: 5
Joined: Mon Aug 24, 2009 6:45 pm

Convert Diacritical Unicode Character and Punctuation Codes

Post by jdrouin » Sat Nov 27, 2010 7:28 am

I have a TEI Tite XML file containing a French text with thousands of diacritical characters. Though the document header declares it is encoded in UTF-8...

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
... the codes for diacritical characters and punctuation are in a different format, i.e.:

Code: Select all

s&#x2019;est &#x00E9;coul&#x00E9; jusqu&#x2019;&#x00E0; son r&#x00E9;veil; mais leurs rangs peuvent se m&#x00EA;ler
I am trying to use the PhiloLogic text mining tool to analyze this text, but it won't find diacritical characters or punctuation unless they're in UTF-8.

How do I convert only the diacritical character and punctuation codes, as above, to UTF-8 codes?

Thanks,

Jeff

jdrouin
Posts: 5
Joined: Mon Aug 24, 2009 6:45 pm

Re: Convert Diacritical Unicode Character and Punctuation Codes

Post by jdrouin » Sat Nov 27, 2010 7:31 am

Sorry, forgot to say I've been trying to use oXygen to change those codes but just can't seem to find anything about it in the help system. Hoping someone here can point me in the right direction.

Best,

Jeff

Radu
Posts: 7028
Joined: Fri Jul 09, 2004 5:18 pm

Re: Convert Diacritical Unicode Character and Punctuation Codes

Post by Radu » Mon Nov 29, 2010 10:10 am

Hi Jeff,

First of all, having characters which are escaped as character entities is perfectly legal in XML documents and if a tool does not handle them properly then it is not 100% XML conformant.

You can select the entire XML file content in Oxygen, right click, in the contextual menu go to Source->Unescape selection. Uncheck all the checkboxes and then check only the "Unescape Characters" checkbox. This should do what you want to accomplish.

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com

Post Reply