Page 1 of 1

Escaping characters

Posted: Wed Dec 04, 2013 9:38 pm
by nam
I have an application that converts Word documents to DITA, and in the process it scans each paragraph for illegal characters. The original documents include left and right quotes, as well as regular double quotes, angle brackets. etc.

My process converts the standard five:
  • " "
    ' '
    < <
    > >
    & &
But for other characters I convert them to their ASCII code, wrapped in the "&#" and ";" characters, so left double quote becomes "&#147;" and right double quote is "&#148;", etc. but all I get in my output is "#".

Are only the previous five allowed, or did I misunderstand how to escape the other series of typable, but illegal characters?

Re: Escaping characters

Posted: Wed Dec 04, 2013 9:57 pm
by adrian
Hi,

Note that you should consider the character encoding. The character codes (#147 and #148) seem to be for ISO8859-1 (or similar).
Since you're working with DITA which is usually using the UTF-8 encoding, you may want to use the corresponding Unicode character codes (U+201C, U+201D): &#x201c; and &#x201d;
It goes without saying that the same applies for other special characters.

Regards,
Adrian

Re: Escaping characters

Posted: Wed Dec 04, 2013 11:38 pm
by nam
Thank you. I have updated my code to do the correct replacements. Finding the control codes was fun!