sorin wrote:Gandalf wrote:The issue is clearly acknowledged in XML as xsl:output has exactly this rule when method=html
I am not sure what is the rule of xsl:output/@method="html" that you are talking about. Can you expand on this please ?
This comes at least from the XSLT 2.0 Serialisation:
XSLT 2.0 Spec wrote:7.4.3 HTML Output Method: the indent Parameter
If the indent parameter has the value yes, then the HTML output method MAY add or remove whitespace as it serializes the result tree, so long as it does not change the way that a conforming HTML user agent would render the output.
Note:
This rule can be satisfied by observing the following constraints:
Whitespace MUST NOT be added other than before or after an element, or adjacent to an existing whitespace character.
Whitespace MUST NOT be added or removed adjacent to an inline element. The inline elements are those included in the %inline category of any of the HTML 4.01 DTD's, as well as the ins and del elements if they are used as inline elements (i.e., if they do not contain element children).
[The same thing is said for XHTML.]
My concern is to be able to Format & Indent without breaking the (X)HTML rules but with the added problem that what might eventually be an inline (X)HTML tag is currently an XML tag. I.e. I have "inline" XML tags that need to be treated just like inline (X)HTML ones would be.
The xml:space attribute applys to the content of a tag, not the space before/after it. So wihout some sort of way as declaring a tag inline I need to look at the tags containing the "inline" tags... As the previous answer pointed out the exmaple I have was element-only content as it only contained nodes and whitespace - no non-whitespace text.
I'm surmising that all the places that Format & Indent is breaking the inline XML is in such cases so it is a case of locating each one and placing a xml:space on the containing node. Doing it globally for the whole document, or globally for all instances of a particular non-inline node, would rather make F&I pointless...
sorin wrote:Gandalf wrote:The problem in this case is I'm starting with 700 pages of supplied XML, I am probably missing something but the only way I'm discovering the added whitespace is by doing Acrobat diffs on the before and after 700 page PDF's.
I think there is a problem with the separation of concerns in your workflow. Adding or removing whitespace in the XML source should not induce differences in the PDF result as the XML source should store only the content of the document decoupled from any formatting/presentation layout which should be stored separately in the XSLT stylesheet which produces the XSL-FO document, assuming this is how you generate the PDF result. You should be able to format and indent (pretty-print) the XML source freely for readability/easy editing purposes without unwanted side effects on the PDF result. If the PDF result is different after a pretty-print operation on the XML source I think the problem is in the XSLT stylesheet (or the step which formats the content of the XML source for output). As long as the canonical form of the XML source document is the same before and after the pretty-print operation there is nothing wrong with this operation.
The PDF is just a tool not my target result and is produced by converting the X(HT)ML displayed in a browser.
But you highlight what I'm obviously failing to understand here. If "Adding or removing whitespace in the XML source should not induce differences in the PDF result as the XML source should store only the content of the document decoupled from any formatting/presentation layout" - which makes sense - how do you specify in the (made up) XML:
<b>bold</b> <em>italic</em> <mathvar>m</math-var><power>3</power>
that the lack of space between </math-var> & <power> is very significant. Stripping all the space is wrong, allowing space between every node is wrong. (The use of <power> is probably bad as it is unlikely that you ever want space before it, clearly some inline tags sometimes have space before/after them and other times not. The previous example I gave contains such a case - look at the "op".)
Tag all text (including whitespace) and strip all space except in that tag? Should work but difficult when not starting from scratch.
None of the methods mentioned so far appear to me to provide an easy way when starting with a pre-supplied 700 page document how to determine easily which non-space is significant and must be preserved.
I can't imagine this is an unknown issue, and there is probably a dead obvious answer, so obvious I'm missing it and will kick myself when I realise
Until I realise or someone enlightens me I'm off to do this the hard way, find each specific case and fix it up by hand with carefully placed xml:space attributes...