Preserving correct XHTML through Format and Indent?
Having trouble installing Oxygen? Got a bug to report? Post it all here.
-
- Posts: 12
- Joined: Sat Sep 02, 2006 7:18 am
Preserving correct XHTML through Format and Indent?
Given XHTML such as:
<p>Some text <b>B</b><i>old</i> blah.</p>
Then format and indent will happily break the line between the </b> and <i> which creates and incorrect space. (Yes you need longer code than the example to trigger this.)
How can you prevent this? Preserve space on <p> doesn't work as then the <p> isn't broken at all. Preserve space on <b> or <i> doesn't work as it is not what they contain which is the issue. What we're trying to say is "if a closing and opening inline tag are adjacent do not break between them".
Inserting a U+2060 zero width word joiner sort of works, but Oxygen displays this as a space after a XSLT transform has removed the ⁠ (at least on a Mac). It is also rather difficult to add these characters to imported (not written) XHTML.
Am I missing something obvious?
TIA
<p>Some text <b>B</b><i>old</i> blah.</p>
Then format and indent will happily break the line between the </b> and <i> which creates and incorrect space. (Yes you need longer code than the example to trigger this.)
How can you prevent this? Preserve space on <p> doesn't work as then the <p> isn't broken at all. Preserve space on <b> or <i> doesn't work as it is not what they contain which is the issue. What we're trying to say is "if a closing and opening inline tag are adjacent do not break between them".
Inserting a U+2060 zero width word joiner sort of works, but Oxygen displays this as a space after a XSLT transform has removed the ⁠ (at least on a Mac). It is also rather difficult to add these characters to imported (not written) XHTML.
Am I missing something obvious?
TIA
-
- Posts: 4141
- Joined: Fri Mar 28, 2003 2:12 pm
Re: Preserving correct XHTML through Format and Indent?
Post by sorin_ristache »
Hello,
Regards,
Sorin
I cannot reproduce the line break between </b> and <i>. I changed the line width option in Options -> Preferences -> Editor / Format -> Line width - Format and Indent and I still could not reproduce it. Please provide a complete sample which shows the problem.Gandalf wrote:Then format and indent will happily break the line between the </b> and <i> which creates and incorrect space. (Yes you need longer code than the example to trigger this.)
Do you mean that an XSLT transform removes the ⁠ character and <oXygen/> displays a space character ( ) in a position of the document where in fact there is no space character ?Gandalf wrote:Oxygen displays this as a space after a XSLT transform has removed the ⁠
Regards,
Sorin
-
- Posts: 12
- Joined: Sat Sep 02, 2006 7:18 am
Re: Preserving correct XHTML through Format and Indent?
I can't find the case I original had (I'm editing a 700 page book) but here is one, start with:sorin wrote:Hello,
I cannot reproduce the line break between </b> and <i>. I changed the line width option in Options -> Preferences -> Editor / Format -> Line width - Format and Indent and I still could not reproduce it. Please provide a complete sample which shows the problem.Gandalf wrote:Then format and indent will happily break the line between the </b> and <i> which creates and incorrect space. (Yes you need longer code than the example to trigger this.)
<table><tr><td>
<span class="code-fragment">operator</span> <span class="op">op</span><span class="code-fragment">(x)</span>
</td></tr></table>
Format and indent gives:
<table>
<tr>
<td>
<span class="code-fragment">operator</span>
<span class="op">op</span>
<span class="code-fragment">(x)</span>
</td>
</tr>
</table>
Which is nice and neat

The strange thing is doesn't always do it - there are large tracts of span's which stay together.
Sorry, I wasn't too clear - after the XSLT the ⁠ gets replaced by a unicode ZWJ which Oxygen displays as a gap - the right character is in the file. But you see gaps in Oxygen where there shouldn't be any. It would be great if Oxygen could display the ZWJ as, say, a small narrow vertical coloured line, but ⁠ has to be better than a gap. Anyway this is part of the reason I said using them isn't a good solution (the other part being inserting them!)sorin wrote:Do you mean that an XSLT transform removes the ⁠ character and <oXygen/> displays a space character ( ) in a position of the document where in fact there is no space character ?Gandalf wrote:Oxygen displays this as a space after a XSLT transform has removed the ⁠
-
- Site Admin
- Posts: 2095
- Joined: Thu Jan 09, 2003 2:58 pm
Hi,
In this case the content of td is identified as element only and that is why it is indented. Suppose you have some non whitespace characters inside td, then oXygen will not add any spaces between the two adiacent span elements. The solution in this case however is to add xml:space="preserve" on the td element, that will give you
Best Regards,
George
In this case the content of td is identified as element only and that is why it is indented. Suppose you have some non whitespace characters inside td, then oXygen will not add any spaces between the two adiacent span elements. The solution in this case however is to add xml:space="preserve" on the td element, that will give you
Code: Select all
-----------------------------------------------------------------------------------------------------------------
<table>
<tr>
<td xml:space="preserve">
<span class="code-fragment">operator</span> <span class="op">op</span><span class="code-fragment">(x)</span>
</td>
</tr>
</table>
George
-
- Posts: 12
- Joined: Sat Sep 02, 2006 7:18 am
The problem in this case is I'm starting with 700 pages of supplied XML, I am probably missing something but the only way I'm discovering the added whitespace is by doing Acrobat diffs on the before and after 700 page PDF's.george wrote:In this case the content of td is identified as element only and that is why it is indented. Suppose you have some non whitespace characters inside td, then oXygen will not add any spaces between the two adiacent span elements. The solution in this case however is to add xml:space="preserve" on the td element
Is there no way to specify tags as containing inline content and that whitespace should never be added between such tags?
The issue is clearly acknowledged in XML as xsl:output has exactly this rule when method=html
In the meantime I'll manually insert xml:space on the cases I discover.
Thanks for the help.
-
- Posts: 4141
- Joined: Fri Mar 28, 2003 2:12 pm
Post by sorin_ristache »
I think there is a problem with the separation of concerns in your workflow. Adding or removing whitespace in the XML source should not induce differences in the PDF result as the XML source should store only the content of the document decoupled from any formatting/presentation layout which should be stored separately in the XSLT stylesheet which produces the XSL-FO document, assuming this is how you generate the PDF result. You should be able to format and indent (pretty-print) the XML source freely for readability/easy editing purposes without unwanted side effects on the PDF result. If the PDF result is different after a pretty-print operation on the XML source I think the problem is in the XSLT stylesheet (or the step which formats the content of the XML source for output). As long as the canonical form of the XML source document is the same before and after the pretty-print operation there is nothing wrong with this operation.Gandalf wrote:The problem in this case is I'm starting with 700 pages of supplied XML, I am probably missing something but the only way I'm discovering the added whitespace is by doing Acrobat diffs on the before and after 700 page PDF's.
Other than inserting xml:space="preserve" multiple times in elements with the same name of the XML source you can:Gandalf wrote:Is there no way to specify tags as containing inline content and that whitespace should never be added between such tags?
...
In the meantime I'll manually insert xml:space on the cases I discover.
- add the element name only once to the Preserve space elements list in Options -> Preferences -> Editor / Format / XML -> Preserve space elements;
- add the xml:space attribute with the default or fixed value "preserve" in the schema of the XML source.
I am not sure what is the rule of xsl:output/@method="html" that you are talking about. Can you expand on this please ?Gandalf wrote:The issue is clearly acknowledged in XML as xsl:output has exactly this rule when method=html
Regards,
Sorin
-
- Posts: 12
- Joined: Sat Sep 02, 2006 7:18 am
This comes at least from the XSLT 2.0 Serialisation:sorin wrote:I am not sure what is the rule of xsl:output/@method="html" that you are talking about. Can you expand on this please ?Gandalf wrote:The issue is clearly acknowledged in XML as xsl:output has exactly this rule when method=html
[The same thing is said for XHTML.]XSLT 2.0 Spec wrote:7.4.3 HTML Output Method: the indent Parameter
If the indent parameter has the value yes, then the HTML output method MAY add or remove whitespace as it serializes the result tree, so long as it does not change the way that a conforming HTML user agent would render the output.
Note:
This rule can be satisfied by observing the following constraints:
Whitespace MUST NOT be added other than before or after an element, or adjacent to an existing whitespace character.
Whitespace MUST NOT be added or removed adjacent to an inline element. The inline elements are those included in the %inline category of any of the HTML 4.01 DTD's, as well as the ins and del elements if they are used as inline elements (i.e., if they do not contain element children).
My concern is to be able to Format & Indent without breaking the (X)HTML rules but with the added problem that what might eventually be an inline (X)HTML tag is currently an XML tag. I.e. I have "inline" XML tags that need to be treated just like inline (X)HTML ones would be.
The xml:space attribute applys to the content of a tag, not the space before/after it. So wihout some sort of way as declaring a tag inline I need to look at the tags containing the "inline" tags... As the previous answer pointed out the exmaple I have was element-only content as it only contained nodes and whitespace - no non-whitespace text.
I'm surmising that all the places that Format & Indent is breaking the inline XML is in such cases so it is a case of locating each one and placing a xml:space on the containing node. Doing it globally for the whole document, or globally for all instances of a particular non-inline node, would rather make F&I pointless...
The PDF is just a tool not my target result and is produced by converting the X(HT)ML displayed in a browser.sorin wrote:I think there is a problem with the separation of concerns in your workflow. Adding or removing whitespace in the XML source should not induce differences in the PDF result as the XML source should store only the content of the document decoupled from any formatting/presentation layout which should be stored separately in the XSLT stylesheet which produces the XSL-FO document, assuming this is how you generate the PDF result. You should be able to format and indent (pretty-print) the XML source freely for readability/easy editing purposes without unwanted side effects on the PDF result. If the PDF result is different after a pretty-print operation on the XML source I think the problem is in the XSLT stylesheet (or the step which formats the content of the XML source for output). As long as the canonical form of the XML source document is the same before and after the pretty-print operation there is nothing wrong with this operation.Gandalf wrote:The problem in this case is I'm starting with 700 pages of supplied XML, I am probably missing something but the only way I'm discovering the added whitespace is by doing Acrobat diffs on the before and after 700 page PDF's.
But you highlight what I'm obviously failing to understand here. If "Adding or removing whitespace in the XML source should not induce differences in the PDF result as the XML source should store only the content of the document decoupled from any formatting/presentation layout" - which makes sense - how do you specify in the (made up) XML:
that the lack of space between </math-var> & <power> is very significant. Stripping all the space is wrong, allowing space between every node is wrong. (The use of <power> is probably bad as it is unlikely that you ever want space before it, clearly some inline tags sometimes have space before/after them and other times not. The previous example I gave contains such a case - look at the "op".)<b>bold</b> <em>italic</em> <mathvar>m</math-var><power>3</power>
Tag all text (including whitespace) and strip all space except in that tag? Should work but difficult when not starting from scratch.
None of the methods mentioned so far appear to me to provide an easy way when starting with a pre-supplied 700 page document how to determine easily which non-space is significant and must be preserved.
I can't imagine this is an unknown issue, and there is probably a dead obvious answer, so obvious I'm missing it and will kick myself when I realise

Until I realise or someone enlightens me I'm off to do this the hard way, find each specific case and fix it up by hand with carefully placed xml:space attributes...
-
- Site Admin
- Posts: 2095
- Joined: Thu Jan 09, 2003 2:58 pm
Hi,
A couple of issues:
1.oXygen format and indent is generic for any XML document, there is nothing specific for XHTML.
2. The problem is that it is not possible to identify that an element can contain mixed content when it contains only elements. In that case oXygen considers the element to contain only elements and indents them. A possible solution will be to allow the user to specify not only the preserve space elements and the strip space elements but also the elements that have mixed content. Eventually if we identify a schema or a DTD we may extract this information from there. That will allow you to specify that td for instance contains mixed content and then the indentation will be done only on whitespaces.
Another possibility will be to have one option to force the indentation only on whitespace, that is
<a><x/><y/></a>
will remain like that
while
<a> <x/> <y/> </a>
will be indented.
Any thoughts on the above options?
Best Regards,
George
A couple of issues:
1.oXygen format and indent is generic for any XML document, there is nothing specific for XHTML.
2. The problem is that it is not possible to identify that an element can contain mixed content when it contains only elements. In that case oXygen considers the element to contain only elements and indents them. A possible solution will be to allow the user to specify not only the preserve space elements and the strip space elements but also the elements that have mixed content. Eventually if we identify a schema or a DTD we may extract this information from there. That will allow you to specify that td for instance contains mixed content and then the indentation will be done only on whitespaces.
Another possibility will be to have one option to force the indentation only on whitespace, that is
<a><x/><y/></a>
will remain like that
while
<a> <x/> <y/> </a>
will be indented.
Any thoughts on the above options?
Best Regards,
George
-
- Posts: 12
- Joined: Sat Sep 02, 2006 7:18 am
Mixed content doesn't cover:george wrote:The problem is that it is not possible to identify that an element can contain mixed content when it contains only elements.
Which clearly is mixed (the above is part of a larger paragraph), another which just bit me:2<power><math>e</math></power>
- the final full stop got detachedblah blah <code-fragment><emphasis>blah</emphasis></code-fragment>.

While you can deal with those two with xml:space they are just another case of inline tags - if inline tags are consecutive or nested no spaces should be added before/after/within.
OK, given I'm relatively new to XSLT...george wrote:In that case oXygen considers the element to contain only elements and indents them. A possible solution will be to allow the user to specify not only the preserve space elements and the strip space elements but also the elements that have mixed content. Eventually if we identify a schema or a DTD we may extract this information from there. That will allow you to specify that td for instance contains mixed content and then the indentation will be done only on whitespaces.
Another possibility will be to have one option to force the indentation only on whitespace, that is
<a><x/><y/></a>
will remain like that
while
<a> <x/> <y/> </a>
will be indented.
Any thoughts on the above options?
The XHTML/HTML xsl:output serialisation is governed by the %inline in the DTD if I read it correctly. That might make sense if you read the DTD.
For the second option would it cope with:
?2<power>1 + <math>e</math></power>
I'll suggest a third option, the F&I is based on two lists for strip space and preserve space, why not add a third for inline tags?
Thanks for all the help, looks like by a manual process I'm about to solve this problem for my 700-page document.
-
- Site Admin
- Posts: 2095
- Joined: Thu Jan 09, 2003 2:58 pm
If the power element is treated as having mixed content then the math element will not be indented, in mixed content indenting is done only if there is a whitespace, otherwise not.
Thanks, the list of inline elements is more or less equivalent with the list of elements that can have mixed content, the difference is that we have to look at the children elements to see if we have an inline element, then we consider that the current element has mixed content while if the list of elements with mixed content is specified we just lookup the current element in that list.
I'm not sure from a user perspective what list is easier to specify. I would go for the mixed elements list as that is more linked with the schemas or DTD, you can see if an element contains both text and elements then it has mixed content. Inline elements are just a notation found in a couple of existing DTDs/schemas and there is no guarantee that an element that appears in mixes content in element1 cannot appear in an element only content in element2.
Best Regards,
George
Thanks, the list of inline elements is more or less equivalent with the list of elements that can have mixed content, the difference is that we have to look at the children elements to see if we have an inline element, then we consider that the current element has mixed content while if the list of elements with mixed content is specified we just lookup the current element in that list.
I'm not sure from a user perspective what list is easier to specify. I would go for the mixed elements list as that is more linked with the schemas or DTD, you can see if an element contains both text and elements then it has mixed content. Inline elements are just a notation found in a couple of existing DTDs/schemas and there is no guarantee that an element that appears in mixes content in element1 cannot appear in an element only content in element2.
Best Regards,
George
Jump to
- Oxygen XML Editor/Author/Developer
- ↳ Feature Request
- ↳ Common Problems
- ↳ DITA (Editing and Publishing DITA Content)
- ↳ SDK-API, Frameworks - Document Types
- ↳ DocBook
- ↳ TEI
- ↳ XHTML
- ↳ Other Issues
- Oxygen XML Web Author
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Content Fusion
- ↳ Feature Request
- ↳ Common Problems
- Oxygen JSON Editor
- ↳ Feature Request
- ↳ Common Problems
- Oxygen PDF Chemistry
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Feedback
- ↳ Feature Request
- ↳ Common Problems
- Oxygen XML WebHelp
- ↳ Feature Request
- ↳ Common Problems
- XML
- ↳ General XML Questions
- ↳ XSLT and FOP
- ↳ XML Schemas
- ↳ XQuery
- NVDL
- ↳ General NVDL Issues
- ↳ oNVDL Related Issues
- XML Services Market
- ↳ Offer a Service