Preserving correct XHTML through Format and Indent?

Having trouble installing Oxygen? Got a bug to report? Post it all here.
Gandalf
Posts: 12
Joined: Sat Sep 02, 2006 7:18 am

Preserving correct XHTML through Format and Indent?

Post by Gandalf »

Given XHTML such as:

<p>Some text <b>B</b><i>old</i> blah.</p>

Then format and indent will happily break the line between the </b> and <i> which creates and incorrect space. (Yes you need longer code than the example to trigger this.)

How can you prevent this? Preserve space on <p> doesn't work as then the <p> isn't broken at all. Preserve space on <b> or <i> doesn't work as it is not what they contain which is the issue. What we're trying to say is "if a closing and opening inline tag are adjacent do not break between them".

Inserting a U+2060 zero width word joiner sort of works, but Oxygen displays this as a space after a XSLT transform has removed the &#x2060; (at least on a Mac). It is also rather difficult to add these characters to imported (not written) XHTML.

Am I missing something obvious?

TIA
sorin_ristache
Posts: 4141
Joined: Fri Mar 28, 2003 2:12 pm

Re: Preserving correct XHTML through Format and Indent?

Post by sorin_ristache »

Hello,
Gandalf wrote:Then format and indent will happily break the line between the </b> and <i> which creates and incorrect space. (Yes you need longer code than the example to trigger this.)
I cannot reproduce the line break between </b> and <i>. I changed the line width option in Options -> Preferences -> Editor / Format -> Line width - Format and Indent and I still could not reproduce it. Please provide a complete sample which shows the problem.
Gandalf wrote:Oxygen displays this as a space after a XSLT transform has removed the &#x2060;
Do you mean that an XSLT transform removes the &#x2060; character and <oXygen/> displays a space character (&#x20;) in a position of the document where in fact there is no space character ?


Regards,
Sorin
Gandalf
Posts: 12
Joined: Sat Sep 02, 2006 7:18 am

Re: Preserving correct XHTML through Format and Indent?

Post by Gandalf »

sorin wrote:Hello,
Gandalf wrote:Then format and indent will happily break the line between the </b> and <i> which creates and incorrect space. (Yes you need longer code than the example to trigger this.)
I cannot reproduce the line break between </b> and <i>. I changed the line width option in Options -> Preferences -> Editor / Format -> Line width - Format and Indent and I still could not reproduce it. Please provide a complete sample which shows the problem.
I can't find the case I original had (I'm editing a 700 page book) but here is one, start with:

<table><tr><td>
<span class="code-fragment">operator</span> <span class="op">op</span><span class="code-fragment">(x)</span>
</td></tr></table>

Format and indent gives:

<table>
<tr>
<td>
<span class="code-fragment">operator</span>
<span class="op">op</span>
<span class="code-fragment">(x)</span>
</td>
</tr>
</table>

Which is nice and neat :-) (it was indented in the original) but unfortunately has added whitespace between "op" and "(x)".

The strange thing is doesn't always do it - there are large tracts of span's which stay together.
sorin wrote:
Gandalf wrote:Oxygen displays this as a space after a XSLT transform has removed the &#x2060;
Do you mean that an XSLT transform removes the &#x2060; character and <oXygen/> displays a space character (&#x20;) in a position of the document where in fact there is no space character ?
Sorry, I wasn't too clear - after the XSLT the &#x2060; gets replaced by a unicode ZWJ which Oxygen displays as a gap - the right character is in the file. But you see gaps in Oxygen where there shouldn't be any. It would be great if Oxygen could display the ZWJ as, say, a small narrow vertical coloured line, but &#x2060; has to be better than a gap. Anyway this is part of the reason I said using them isn't a good solution (the other part being inserting them!)
george
Site Admin
Posts: 2095
Joined: Thu Jan 09, 2003 2:58 pm

Post by george »

Hi,

In this case the content of td is identified as element only and that is why it is indented. Suppose you have some non whitespace characters inside td, then oXygen will not add any spaces between the two adiacent span elements. The solution in this case however is to add xml:space="preserve" on the td element, that will give you

Code: Select all


-----------------------------------------------------------------------------------------------------------------
<table>
<tr>
<td xml:space="preserve">
<span class="code-fragment">operator</span> <span class="op">op</span><span class="code-fragment">(x)</span>
</td>
</tr>
</table>
Best Regards,
George
Gandalf
Posts: 12
Joined: Sat Sep 02, 2006 7:18 am

Post by Gandalf »

george wrote:In this case the content of td is identified as element only and that is why it is indented. Suppose you have some non whitespace characters inside td, then oXygen will not add any spaces between the two adiacent span elements. The solution in this case however is to add xml:space="preserve" on the td element
The problem in this case is I'm starting with 700 pages of supplied XML, I am probably missing something but the only way I'm discovering the added whitespace is by doing Acrobat diffs on the before and after 700 page PDF's.

Is there no way to specify tags as containing inline content and that whitespace should never be added between such tags?

The issue is clearly acknowledged in XML as xsl:output has exactly this rule when method=html

In the meantime I'll manually insert xml:space on the cases I discover.

Thanks for the help.
sorin_ristache
Posts: 4141
Joined: Fri Mar 28, 2003 2:12 pm

Post by sorin_ristache »

Gandalf wrote:The problem in this case is I'm starting with 700 pages of supplied XML, I am probably missing something but the only way I'm discovering the added whitespace is by doing Acrobat diffs on the before and after 700 page PDF's.
I think there is a problem with the separation of concerns in your workflow. Adding or removing whitespace in the XML source should not induce differences in the PDF result as the XML source should store only the content of the document decoupled from any formatting/presentation layout which should be stored separately in the XSLT stylesheet which produces the XSL-FO document, assuming this is how you generate the PDF result. You should be able to format and indent (pretty-print) the XML source freely for readability/easy editing purposes without unwanted side effects on the PDF result. If the PDF result is different after a pretty-print operation on the XML source I think the problem is in the XSLT stylesheet (or the step which formats the content of the XML source for output). As long as the canonical form of the XML source document is the same before and after the pretty-print operation there is nothing wrong with this operation.
Gandalf wrote:Is there no way to specify tags as containing inline content and that whitespace should never be added between such tags?
...
In the meantime I'll manually insert xml:space on the cases I discover.
Other than inserting xml:space="preserve" multiple times in elements with the same name of the XML source you can:

- add the element name only once to the Preserve space elements list in Options -> Preferences -> Editor / Format / XML -> Preserve space elements;

- add the xml:space attribute with the default or fixed value "preserve" in the schema of the XML source.
Gandalf wrote:The issue is clearly acknowledged in XML as xsl:output has exactly this rule when method=html
I am not sure what is the rule of xsl:output/@method="html" that you are talking about. Can you expand on this please ?


Regards,
Sorin
Gandalf
Posts: 12
Joined: Sat Sep 02, 2006 7:18 am

Post by Gandalf »

sorin wrote:
Gandalf wrote:The issue is clearly acknowledged in XML as xsl:output has exactly this rule when method=html
I am not sure what is the rule of xsl:output/@method="html" that you are talking about. Can you expand on this please ?
This comes at least from the XSLT 2.0 Serialisation:
XSLT 2.0 Spec wrote:7.4.3 HTML Output Method: the indent Parameter

If the indent parameter has the value yes, then the HTML output method MAY add or remove whitespace as it serializes the result tree, so long as it does not change the way that a conforming HTML user agent would render the output.

Note:

This rule can be satisfied by observing the following constraints:

Whitespace MUST NOT be added other than before or after an element, or adjacent to an existing whitespace character.

Whitespace MUST NOT be added or removed adjacent to an inline element. The inline elements are those included in the %inline category of any of the HTML 4.01 DTD's, as well as the ins and del elements if they are used as inline elements (i.e., if they do not contain element children).
[The same thing is said for XHTML.]

My concern is to be able to Format & Indent without breaking the (X)HTML rules but with the added problem that what might eventually be an inline (X)HTML tag is currently an XML tag. I.e. I have "inline" XML tags that need to be treated just like inline (X)HTML ones would be.

The xml:space attribute applys to the content of a tag, not the space before/after it. So wihout some sort of way as declaring a tag inline I need to look at the tags containing the "inline" tags... As the previous answer pointed out the exmaple I have was element-only content as it only contained nodes and whitespace - no non-whitespace text.

I'm surmising that all the places that Format & Indent is breaking the inline XML is in such cases so it is a case of locating each one and placing a xml:space on the containing node. Doing it globally for the whole document, or globally for all instances of a particular non-inline node, would rather make F&I pointless...
sorin wrote:
Gandalf wrote:The problem in this case is I'm starting with 700 pages of supplied XML, I am probably missing something but the only way I'm discovering the added whitespace is by doing Acrobat diffs on the before and after 700 page PDF's.
I think there is a problem with the separation of concerns in your workflow. Adding or removing whitespace in the XML source should not induce differences in the PDF result as the XML source should store only the content of the document decoupled from any formatting/presentation layout which should be stored separately in the XSLT stylesheet which produces the XSL-FO document, assuming this is how you generate the PDF result. You should be able to format and indent (pretty-print) the XML source freely for readability/easy editing purposes without unwanted side effects on the PDF result. If the PDF result is different after a pretty-print operation on the XML source I think the problem is in the XSLT stylesheet (or the step which formats the content of the XML source for output). As long as the canonical form of the XML source document is the same before and after the pretty-print operation there is nothing wrong with this operation.
The PDF is just a tool not my target result and is produced by converting the X(HT)ML displayed in a browser.

But you highlight what I'm obviously failing to understand here. If "Adding or removing whitespace in the XML source should not induce differences in the PDF result as the XML source should store only the content of the document decoupled from any formatting/presentation layout" - which makes sense - how do you specify in the (made up) XML:
<b>bold</b> <em>italic</em> <mathvar>m</math-var><power>3</power>
that the lack of space between </math-var> & <power> is very significant. Stripping all the space is wrong, allowing space between every node is wrong. (The use of <power> is probably bad as it is unlikely that you ever want space before it, clearly some inline tags sometimes have space before/after them and other times not. The previous example I gave contains such a case - look at the "op".)

Tag all text (including whitespace) and strip all space except in that tag? Should work but difficult when not starting from scratch.

None of the methods mentioned so far appear to me to provide an easy way when starting with a pre-supplied 700 page document how to determine easily which non-space is significant and must be preserved.

I can't imagine this is an unknown issue, and there is probably a dead obvious answer, so obvious I'm missing it and will kick myself when I realise :oops:

Until I realise or someone enlightens me I'm off to do this the hard way, find each specific case and fix it up by hand with carefully placed xml:space attributes...
george
Site Admin
Posts: 2095
Joined: Thu Jan 09, 2003 2:58 pm

Post by george »

Hi,

A couple of issues:
1.oXygen format and indent is generic for any XML document, there is nothing specific for XHTML.
2. The problem is that it is not possible to identify that an element can contain mixed content when it contains only elements. In that case oXygen considers the element to contain only elements and indents them. A possible solution will be to allow the user to specify not only the preserve space elements and the strip space elements but also the elements that have mixed content. Eventually if we identify a schema or a DTD we may extract this information from there. That will allow you to specify that td for instance contains mixed content and then the indentation will be done only on whitespaces.

Another possibility will be to have one option to force the indentation only on whitespace, that is
<a><x/><y/></a>
will remain like that
while
<a> <x/> <y/> </a>
will be indented.

Any thoughts on the above options?

Best Regards,
George
Gandalf
Posts: 12
Joined: Sat Sep 02, 2006 7:18 am

Post by Gandalf »

george wrote:The problem is that it is not possible to identify that an element can contain mixed content when it contains only elements.
Mixed content doesn't cover:
2<power><math>e</math></power>
Which clearly is mixed (the above is part of a larger paragraph), another which just bit me:
blah blah <code-fragment><emphasis>blah</emphasis></code-fragment>.
- the final full stop got detached :-(

While you can deal with those two with xml:space they are just another case of inline tags - if inline tags are consecutive or nested no spaces should be added before/after/within.
george wrote:In that case oXygen considers the element to contain only elements and indents them. A possible solution will be to allow the user to specify not only the preserve space elements and the strip space elements but also the elements that have mixed content. Eventually if we identify a schema or a DTD we may extract this information from there. That will allow you to specify that td for instance contains mixed content and then the indentation will be done only on whitespaces.

Another possibility will be to have one option to force the indentation only on whitespace, that is
<a><x/><y/></a>
will remain like that
while
<a> <x/> <y/> </a>
will be indented.

Any thoughts on the above options?
OK, given I'm relatively new to XSLT...

The XHTML/HTML xsl:output serialisation is governed by the %inline in the DTD if I read it correctly. That might make sense if you read the DTD.

For the second option would it cope with:
2<power>1 + <math>e</math></power>
?

I'll suggest a third option, the F&I is based on two lists for strip space and preserve space, why not add a third for inline tags?

Thanks for all the help, looks like by a manual process I'm about to solve this problem for my 700-page document.
george
Site Admin
Posts: 2095
Joined: Thu Jan 09, 2003 2:58 pm

Post by george »

If the power element is treated as having mixed content then the math element will not be indented, in mixed content indenting is done only if there is a whitespace, otherwise not.

Thanks, the list of inline elements is more or less equivalent with the list of elements that can have mixed content, the difference is that we have to look at the children elements to see if we have an inline element, then we consider that the current element has mixed content while if the list of elements with mixed content is specified we just lookup the current element in that list.

I'm not sure from a user perspective what list is easier to specify. I would go for the mixed elements list as that is more linked with the schemas or DTD, you can see if an element contains both text and elements then it has mixed content. Inline elements are just a notation found in a couple of existing DTDs/schemas and there is no guarantee that an element that appears in mixes content in element1 cannot appear in an element only content in element2.

Best Regards,
George
Post Reply