Escaping characters

nam
Posts: 18
Joined: Fri Apr 21, 2006 12:41 am

Escaping characters

Post by nam » Wed Dec 04, 2013 9:38 pm

I have an application that converts Word documents to DITA, and in the process it scans each paragraph for illegal characters. The original documents include left and right quotes, as well as regular double quotes, angle brackets. etc.

My process converts the standard five:
  • " "
    ' '
    < <
    > >
    & &
But for other characters I convert them to their ASCII code, wrapped in the "&#" and ";" characters, so left double quote becomes "&#147;" and right double quote is "&#148;", etc. but all I get in my output is "#".

Are only the previous five allowed, or did I misunderstand how to escape the other series of typable, but illegal characters?
Neil in Washington

adrian
Posts: 2648
Joined: Tue May 17, 2005 4:01 pm

Re: Escaping characters

Post by adrian » Wed Dec 04, 2013 9:57 pm

Hi,

Note that you should consider the character encoding. The character codes (#147 and #148) seem to be for ISO8859-1 (or similar).
Since you're working with DITA which is usually using the UTF-8 encoding, you may want to use the corresponding Unicode character codes (U+201C, U+201D): &#x201c; and &#x201d;
It goes without saying that the same applies for other special characters.

Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com

nam
Posts: 18
Joined: Fri Apr 21, 2006 12:41 am

Re: Escaping characters

Post by nam » Wed Dec 04, 2013 11:38 pm

Thank you. I have updated my code to do the correct replacements. Finding the control codes was fun!
Neil in Washington

Post Reply