HTML5 and foreign characters
Here should go questions about transforming XML with XSLT and FOP.
-
- Posts: 8
- Joined: Thu Dec 09, 2010 12:32 pm
HTML5 and foreign characters
Hi,
I try to générate HTML5 with bi-lingual documents. Every part has the right language-attribut like this:
I get this kinf of output:
How is it possible to keep the UTF8 characters as they were in the original file (without being transformed in entities?)
I probably missed something ... parameter or something else.
Thans for your answer.
Eric
I try to générate HTML5 with bi-lingual documents. Every part has the right language-attribut like this:
Code: Select all
...
<chapter xml:lang="ru">
<title><foreignphrase xml:lang="en">Russian text</foreignphrase></title>
<subtitle>РАЗВЕДКА ДОНЕСЕНИЕ ИНФОРМАЦИИ</subtitle>
...
Code: Select all
-----
html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title xmlns:ng="http://docbook.org/docbook-ng">Глава 1. Russian text</title>
-----
I probably missed something ... parameter or something else.
Thans for your answer.
Eric
-
- Posts: 9431
- Joined: Fri Jul 09, 2004 5:18 pm
Re: HTML5 and foreign characters
Hi Eric,
You seem to be using some kind of Docbook vocabulary. Do you use custom XSLT stylesheets for publishing it to HTML? If so, you should look in your XSLT stylesheets for something like:
and make sure that the set encoding is UTF-8, otherwise if the encoding is not very permissive (like for example ASCII or ISO 8859-1) and cannot represent that specific character, the XSLT processor will automatically escape all characters which cannot be directly saved with the set encoding.
Regards,
Radu
You seem to be using some kind of Docbook vocabulary. Do you use custom XSLT stylesheets for publishing it to HTML? If so, you should look in your XSLT stylesheets for something like:
Code: Select all
<xsl:output encoding="..."/>
Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
<oXygen/> XML Editor
http://www.oxygenxml.com
-
- Posts: 8
- Joined: Thu Dec 09, 2010 12:32 pm
Re: HTML5 and foreign characters
Hi,
thanks for the answer:
I install a fresh copy of oXygen 19; run a test with the same input files without any customization file. Here is what I got:
* html, html-chunk: got entities and no UTF-8 chars
* xhtml, xhtml-chunk: output is fine ... but it's xhtml and not html5
I found that ISO-8859-1 was set in a lot of files (output method); for java-help and web-help, it's OK, because UTF-8 is not well handled, but why in so many other files?
Is there a parameter I could adjust to avoid to modify all the files?
Best regards
Eric
thanks for the answer:
I install a fresh copy of oXygen 19; run a test with the same input files without any customization file. Here is what I got:
* html, html-chunk: got entities and no UTF-8 chars
* xhtml, xhtml-chunk: output is fine ... but it's xhtml and not html5
I found that ISO-8859-1 was set in a lot of files (output method); for java-help and web-help, it's OK, because UTF-8 is not well handled, but why in so many other files?
Is there a parameter I could adjust to avoid to modify all the files?
Best regards
Eric
-
- Posts: 9431
- Joined: Fri Jul 09, 2004 5:18 pm
Re: HTML5 and foreign characters
Hi Eric,
I have no idea why the Docbook XSLT stylesheets use this encoding by default when generating HTML documents (but use UTF-8 for XHTML). I added an issue for this on the Docbook XSLs project:
https://sourceforge.net/p/docbook/bugs/1400/
Making directly changes to the "OXYGEN_INSTALL_DIR\frameworks\docbook\xsl\html\profile-docbook.xsl" and setting UTF-8 as encoding should be enough if you are not chunking the output.
When chunking the output there seems to be a special parameter called chunker.output.encoding which you can set in the transformation scenario to "UTF-8".
Regards,
Radu
I have no idea why the Docbook XSLT stylesheets use this encoding by default when generating HTML documents (but use UTF-8 for XHTML). I added an issue for this on the Docbook XSLs project:
https://sourceforge.net/p/docbook/bugs/1400/
Making directly changes to the "OXYGEN_INSTALL_DIR\frameworks\docbook\xsl\html\profile-docbook.xsl" and setting UTF-8 as encoding should be enough if you are not chunking the output.
When chunking the output there seems to be a special parameter called chunker.output.encoding which you can set in the transformation scenario to "UTF-8".
Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
<oXygen/> XML Editor
http://www.oxygenxml.com
-
- Posts: 9431
- Joined: Fri Jul 09, 2004 5:18 pm
Re: HTML5 and foreign characters
Hi Eric,
Today I went back to look more into this.
So with Docbook to HTML, XHTML and so on you can either use the Saxon 6.5.5 processor or the Xalan processor. If you edit the transformation scenario there is a "Transformer" combo box where you can choose your processor. From what I tested the Xalan processor should not have this problem of outputting those escaped characters when using method="html". But the Saxon 6.5.5 processor indeed does this by default.
For example if I create a simple XSLT stylesheet which just copies the XML content:
and apply it over the XML using Saxon 6, it will escape all non-ASCII characters.
I found something about this in the Saxon 6 change list:
http://saxon.sourceforge.net/saxon6.5.3/changes.html
so using that in my simple test copy XSLT seemed to work.
But after I opened the OXYGEN_INSTALL_DIR\frameworks\docbook\xsl\html\profile-docbook.xsl and replacing it's xsl:output with:
producing a single HTML (not chuncked) from the Docbook file still has the same problem. And I do not know why.
Regards,
Radu
Today I went back to look more into this.
So with Docbook to HTML, XHTML and so on you can either use the Saxon 6.5.5 processor or the Xalan processor. If you edit the transformation scenario there is a "Transformer" combo box where you can choose your processor. From what I tested the Xalan processor should not have this problem of outputting those escaped characters when using method="html". But the Saxon 6.5.5 processor indeed does this by default.
For example if I create a simple XSLT stylesheet which just copies the XML content:
Code: Select all
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html"/>
<xsl:template match="node() | @*">
<xsl:copy>
<xsl:apply-templates select="node() | @*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
I found something about this in the Saxon 6 change list:
http://saxon.sourceforge.net/saxon6.5.3/changes.html
So by default, without this parameter being explicitly set the default is "entity;decimal" meaning that all non-ASCII chars will be encoded when using method="html". But the parameter can be changed directly from the XSLT stylesheet like:saxon:character-representation gives the preferred representation for special characters. For method="xml" the values are "hex" or "decimal" controlling whether character references should be in decimal or hexadecimal notation: this aplies only to characters outside the selected encoding. For method="html" two values may be given, separated by a semicolon. The first gives the representation for non-ASCII characters that are present in the target character set: the values are "native", "entity", "decimal", or "hex". The second gives the representation for characters outside the selected encoding: the same values can be used, except for "native". For example if encoding="iso-8859-1", then saxon:character-representation="native;hex" causes characters in the range 0-255 to be written as themselves (except less-than, ampersand, etc which are always written as entity references, as is non-breaking-space), and causes characters outside this range to be written as hexadecimal character references. By contrast "entity;decimal" causes characters in the range 160-255 to be written using HTML-defined symbolic entities, and characters above 255 to be written in decimal. The default is "entity;decimal".
Code: Select all
<xsl:output method="html" saxon:character-representation="native" xmlns:saxon="http://saxon.sf.net/"/>
But after I opened the OXYGEN_INSTALL_DIR\frameworks\docbook\xsl\html\profile-docbook.xsl and replacing it's xsl:output with:
Code: Select all
<xsl:output method="html" encoding="UTF-8" indent="no" saxon:character-representation="native" xmlns:saxon="http://saxon.sf.net/"/>
Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
<oXygen/> XML Editor
http://www.oxygenxml.com
Jump to
- Oxygen XML Editor/Author/Developer
- ↳ Feature Request
- ↳ Common Problems
- ↳ DITA (Editing and Publishing DITA Content)
- ↳ SDK-API, Frameworks - Document Types
- ↳ DocBook
- ↳ TEI
- ↳ XHTML
- ↳ Other Issues
- Oxygen XML Web Author
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Content Fusion
- ↳ Feature Request
- ↳ Common Problems
- Oxygen JSON Editor
- ↳ Feature Request
- ↳ Common Problems
- Oxygen PDF Chemistry
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Feedback
- ↳ Feature Request
- ↳ Common Problems
- Oxygen XML WebHelp
- ↳ Feature Request
- ↳ Common Problems
- XML
- ↳ General XML Questions
- ↳ XSLT and FOP
- ↳ XML Schemas
- ↳ XQuery
- NVDL
- ↳ General NVDL Issues
- ↳ oNVDL Related Issues
- XML Services Market
- ↳ Offer a Service