HTML5 and foreign characters

Here should go questions about transforming XML with XSLT and FOP.
Ericounet
Posts: 8
Joined: Thu Dec 09, 2010 12:32 pm

HTML5 and foreign characters

Post by Ericounet »

Hi,

I try to générate HTML5 with bi-lingual documents. Every part has the right language-attribut like this:

Code: Select all


...
<chapter xml:lang="ru">
<title><foreignphrase xml:lang="en">Russian text</foreignphrase></title>
<subtitle>РАЗВЕДКА ДОНЕСЕНИЕ ИНФОРМАЦИИ</subtitle>
...
I get this kinf of output:

Code: Select all


-----
html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title xmlns:ng="http://docbook.org/docbook-ng">&#1043;&#1083;&#1072;&#1074;&#1072; 1. Russian text</title>
-----
How is it possible to keep the UTF8 characters as they were in the original file (without being transformed in entities?)
I probably missed something ... parameter or something else.

Thans for your answer.

Eric
Radu
Posts: 9018
Joined: Fri Jul 09, 2004 5:18 pm

Re: HTML5 and foreign characters

Post by Radu »

Hi Eric,

You seem to be using some kind of Docbook vocabulary. Do you use custom XSLT stylesheets for publishing it to HTML? If so, you should look in your XSLT stylesheets for something like:

Code: Select all

<xsl:output encoding="..."/>
and make sure that the set encoding is UTF-8, otherwise if the encoding is not very permissive (like for example ASCII or ISO 8859-1) and cannot represent that specific character, the XSLT processor will automatically escape all characters which cannot be directly saved with the set encoding.

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
Ericounet
Posts: 8
Joined: Thu Dec 09, 2010 12:32 pm

Re: HTML5 and foreign characters

Post by Ericounet »

Hi,

thanks for the answer:

I install a fresh copy of oXygen 19; run a test with the same input files without any customization file. Here is what I got:

* html, html-chunk: got entities and no UTF-8 chars

* xhtml, xhtml-chunk: output is fine ... but it's xhtml and not html5

I found that ISO-8859-1 was set in a lot of files (output method); for java-help and web-help, it's OK, because UTF-8 is not well handled, but why in so many other files?

Is there a parameter I could adjust to avoid to modify all the files?

Best regards

Eric
Radu
Posts: 9018
Joined: Fri Jul 09, 2004 5:18 pm

Re: HTML5 and foreign characters

Post by Radu »

Hi Eric,

I have no idea why the Docbook XSLT stylesheets use this encoding by default when generating HTML documents (but use UTF-8 for XHTML). I added an issue for this on the Docbook XSLs project:

https://sourceforge.net/p/docbook/bugs/1400/

Making directly changes to the "OXYGEN_INSTALL_DIR\frameworks\docbook\xsl\html\profile-docbook.xsl" and setting UTF-8 as encoding should be enough if you are not chunking the output.
When chunking the output there seems to be a special parameter called chunker.output.encoding which you can set in the transformation scenario to "UTF-8".

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
Ericounet
Posts: 8
Joined: Thu Dec 09, 2010 12:32 pm

Re: HTML5 and foreign characters

Post by Ericounet »

Hi Radu,

thanks. I'll try it

I checked the "original" stylesheets and the parameters are the same as in the oXygen distribution:(8859).

Best regards.

Eric
Ericounet
Posts: 8
Joined: Thu Dec 09, 2010 12:32 pm

Re: HTML5 and foreign characters

Post by Ericounet »

Hi again, Radu,

I tested your solutions and they don't work ....

:(

Thanks for the time you spent for me.

Best regards

Eric

ps: In any case, I still can use pandoc.
Radu
Posts: 9018
Joined: Fri Jul 09, 2004 5:18 pm

Re: HTML5 and foreign characters

Post by Radu »

Hi Eric,

Today I went back to look more into this.
So with Docbook to HTML, XHTML and so on you can either use the Saxon 6.5.5 processor or the Xalan processor. If you edit the transformation scenario there is a "Transformer" combo box where you can choose your processor. From what I tested the Xalan processor should not have this problem of outputting those escaped characters when using method="html". But the Saxon 6.5.5 processor indeed does this by default.
For example if I create a simple XSLT stylesheet which just copies the XML content:

Code: Select all

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html"/>
<xsl:template match="node() | @*">
<xsl:copy>
<xsl:apply-templates select="node() | @*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
and apply it over the XML using Saxon 6, it will escape all non-ASCII characters.
I found something about this in the Saxon 6 change list:

http://saxon.sourceforge.net/saxon6.5.3/changes.html
saxon:character-representation gives the preferred representation for special characters. For method="xml" the values are "hex" or "decimal" controlling whether character references should be in decimal or hexadecimal notation: this aplies only to characters outside the selected encoding. For method="html" two values may be given, separated by a semicolon. The first gives the representation for non-ASCII characters that are present in the target character set: the values are "native", "entity", "decimal", or "hex". The second gives the representation for characters outside the selected encoding: the same values can be used, except for "native". For example if encoding="iso-8859-1", then saxon:character-representation="native;hex" causes characters in the range 0-255 to be written as themselves (except less-than, ampersand, etc which are always written as entity references, as is non-breaking-space), and causes characters outside this range to be written as hexadecimal character references. By contrast "entity;decimal" causes characters in the range 160-255 to be written using HTML-defined symbolic entities, and characters above 255 to be written in decimal. The default is "entity;decimal".
So by default, without this parameter being explicitly set the default is "entity;decimal" meaning that all non-ASCII chars will be encoded when using method="html". But the parameter can be changed directly from the XSLT stylesheet like:

Code: Select all

 <xsl:output method="html" saxon:character-representation="native"  xmlns:saxon="http://saxon.sf.net/"/>
so using that in my simple test copy XSLT seemed to work.
But after I opened the OXYGEN_INSTALL_DIR\frameworks\docbook\xsl\html\profile-docbook.xsl and replacing it's xsl:output with:

Code: Select all

  <xsl:output method="html" encoding="UTF-8" indent="no" saxon:character-representation="native"  xmlns:saxon="http://saxon.sf.net/"/>
producing a single HTML (not chuncked) from the Docbook file still has the same problem. And I do not know why.

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
Ericounet
Posts: 8
Joined: Thu Dec 09, 2010 12:32 pm

Re: HTML5 and foreign characters

Post by Ericounet »

Hi radu,

thanks, Xalan works perfecty :)

Best regards

Eric
Post Reply