Page 1 of 1

CHM Output from DITA Map

Posted: Wed Aug 31, 2011 4:13 pm
by StefanG
Hi,

I have created a DITA map with a couple of topics and tried to convert it to CHM. Works so far and the CHM is created fine. However, the TOC in the CHM is corrupted and all german umlauts (ä, ö, ü, ß etc.) are replaced with question marks (?).
I noticed that also in the hhp and hhc files all umlauts are converted to question marks. This is strange, especially because the hhc and hhp files have Unicode encoding.

I have also changed the xsl:output encoding from "UTF-8" to "windows-1252" by editing the "dita2html.xsl" in \frameworks\dita\DITA-OT\xsl\dita2html.xsl but that didn't make a difference.

As HTML Help Workshop is not unicode enabled, it doesn't make sense output UTF-8, right? That's why I'm wondering why this is the default?

How do I get this working?

Cheers,
*Stefan.

Re: CHM Output from DITA Map

Posted: Wed Aug 31, 2011 4:52 pm
by Radu
Hi Stefan,

The right part of the Windows Help dialog is actually an embedded Internet Explorer so there is no need to make the modification that you did in the stylesheets, it understands UTF8 perfectly.

The TOC, Search and Index parts are compiled from the hhp and hhc files and usually there are problems when displaying characters in them.
We also made some patches to the stylesheets in order to properly show French characters, patches which might not work properly in your case.
Please open the XSLS:

OXYGEN_INSTALL_DIR/frameworks/dita/DITA-OT/xsl/map2hhp.xsl

and

OXYGEN_INSTALL_DIR/frameworks/dita/DITA-OT/xsl/map2hhc.xsl

and modify the output encodings back to UTF-8.

In my case, on a Windows 7 machine with English locale these changes were enough to correctly display characters in the TOC, Index and Search parts of the dialog.

Regards,
Radu

Re: CHM Output from DITA Map

Posted: Wed Aug 31, 2011 6:52 pm
by StefanG
Hi Radu,

thanks much for your fast answer. This is so obvious that I did not even consider to try it. It works for most umlauts now, but the underlying codepage mapper does not cover all upper ansi characters, that is, the following chars from the upper ansi area still resultinquestion marks:
€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜š›œžŸ

However, I'm surprised to see, that the topic htmls seem to be iso-8859-1 now. Where does this come from? I have the following settings now:

dita2html.xsl
    

Code: Select all

<xsl:output method="html" encoding="UTF-8" indent="no"/>
map2hhc.xsl
    

Code: Select all

<xsl:output method="html" encoding="UTF-8" indent="no"/>
map2hhp.xsl

Code: Select all

<xsl:output method="text" encoding="UTF-8"/>
However, the html topics get charset=iso-8859-1 and are no longer utf-8:

Code: Select all

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
From where (which xsl?) does this come from? And how can I force it, that the hhp and hhc are iso-8859-1 (or preferably something better fitting and more complete like windows-1252), but the topics remain as utf-8?

Thanks again for your support :-)

Cheers,
*Stefan.

Re: CHM Output from DITA Map

Posted: Thu Sep 01, 2011 3:12 pm
by Radu
Hi Stefan,

So in the DITA Open Toolkit ANT build file:

OXYGEN_INSTALL_DIR/frameworks/dita/DITA-OT/build_dita2htmlhelp.xml

there is a target called dita.htmlhelp.convertlang.

This target is responsible for taking the HTML, HHP and HHC files outputted by the XSLT stylesheets in UTF-8 encoding and converting the encoding to an encoding considered appropriate for the "xml:lang" value which you have set on the DITA Map root element.

The Java class which performs such conversions (of encoding and also translates some characters to their corresponding entities) is:

org.dita.dost.util.ConvertLang

I looked into it, for de-de it uses "iso-8859-1" for the HTML topic files and "windows-1252" as the encoding for the HHP and HHC files. It also has a hardcoded map between some characters which overflow the encodings and accepted HTML entities.

What exactly are the character codes (as hexadecimal or numbers) of the characters which are not properly displayed? The forum post seems to have garbled them. You can see the character codes in Oxygen in the status bar when the caret is before the character.

So besides reverting the changes Oxygen had in the two stylesheets you should not modify the stylesheets at all as the post processing step from the ANT build file will expect them all (HHP, HHC or HTML) to be UTF-8 encoded.

Regards,
Radu