[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] XML apparently cannot be used for general text markup: whitespace gripe


Subject: Re: [xsl] XML apparently cannot be used for general text markup: whitespace gripe
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Tue, 19 Mar 2002 11:57:16 -0500

Hi all,

At 09:06 AM 3/19/2002, Chad wrote:
 I've noticed a lot of xml-derived web pages out there have screwed up
whitespace (words crammed together or an incorrect space before ending
punctuation).

Or spurious whitespace within words, or ... and not only xml-derived pages but many kinds of pages that apparently come out of automated production systems. (Don't blame XML: this issue predates it.)


 My conclusion is that blocks straight text (such as paragraphs) cannot be
further marked up with XML without screwing up spacing.

I'd like to answer this at (even) more length, but time constraints prevent it. Still, it's an important issue.


As I see it there are really only two ways you can go with this problem. Place responsibility for correct whitespace usage at the point of production of the XML, clearly distinguishing where whitespace is significant (must be preserved as given) and where it's not. Or place responsibility at the point of processing, e.g. have the stylesheet designer plan for munging.

The first approach is, I believe, preferable where possible, and is in keeping with the whitespace-handling mechanisms provided in XML (such as they are). The placement of whitespace correctly is recognized as an authorial and editorial conern, much like correct spelling or grammar. In this scenario, you would just never have to deal with the input:

    <par>
      Is his name really <first>John</first>      <last>Doe</last>?
    </par>

but instead, would have

<par>Is his name really <first>John</first> <last>Doe</last>?</par>

It would be an editorial responsibility to make sure that content of your <par> elements would follow the rule here, that as far as whitespace is concerned, WYSIWYG -- so garbage in, garbage out. This is a purist approach, taking the line that if it's data, it's data, and that it's really too much to expect any lightweight text processor to have heuristics intelligent enough to know that, e.g., the initial whitespace appearing after the start tag but before the word "Is", doesn't count, but *one* of the spaces between the <first> and <last> elements does.

On the other hand, this approach is not always possible -- if not least because in a system where whitespace cleanup was mandated editorially, you might be the person asked to write a routine to fix the inevitable whitespace problems before handing the data to the production staff, and you might want to have automated or semi-automated ways to do this. (Not all authors can be trusted; and what's worse, some XML systems introduce whitespace "for you", taking control away.)

So how to write the XSLT to do the cleanup? As I said, I can't specify it here in detail, but a general approach would be:

1. Normalize space on all text nodes
(i.e. remove leading and trailing, collapse all internal whitespace to
a single space character)
2. Use heuristics to add single #32 characters back in to pad where there should
really be whitespace. Heuristics would include:
2a. which elements are concerned
e.g. add it back after whitespace normalization here:
<first>John</first> <last>Doe</last>
but not here:
H<sub>2</sub>O
2b. which neighboring characters are around
e.g. don't add whitespace back before punctuation characters, as in
<last>Doe</last>?</par>
3. Serialize and/or post-process using tools that will not introduce or remove whitespace, particularly not inside elements that contain any non-whitespace #PCDATA (or better, inside any element of any type that contains #PCDATA anywhere).


Naturally it would be nice if these heuristics could be generalized to the point where there could be a standard way of handling whitespace, e.g. in browsers; but I think you can see that 2b. is a very tall order (language dependent and not always consistent within a language) and 2a. is impossible in the general case without either (a) some kind of support from a schema or specification, to distinguish e.g. between "word-level" and "character-level" markup, or (b) extending xml:space with some monstrous semantics and using it all over the place.

At root, I think we see this problem as an expression of the Worlds in Collision represented by XML: on the data side, people are used to throwing in whitespace wherever, just to make the source code readable (which in principle is a good thing); whereas on the document side, white space has to be regarded as part of our source data since we simply have no way of knowing when it's not. In other words, whitespace is both, or either, data content, or "just markup" -- as it always has been.

But I'd be interested in what others have to say about this.

Cheers,
Wendell

Chad continued:
 For example, can anyone get this simple document into HTML without either
removing required spaces or adding inappropriate spaces?

  <?xml version="1.0"?>
  <book>
     <par>
      Is his name really <first>John</first>      <last>Doe</last>?
    </par>
  </book>

 Either you will end up with:
    "Is his name really JohnDoe?"
  which is wrong, or:
    "Is his name really John Doe ?"
  which is also wrong.

 Of course, this is a very simple example. In real-life situations bad
whitespace causes really nasty problems.  Of course, I'm pretty new to XSL
so maybe I just can't read the directions. Here's my XSL example:

 <?xml version="1.0" encoding="utf-8"?>
 <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
  <xsl:output method="html"/>
  <xsl:preserve-space elements="*"/>
    <xsl:template match="/">
      <html><xsl:apply-templates/></html>
    </xsl:template>
 </xsl:transform>

Does anyone know of a work-around for this common problem?


======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list




Current Thread
Keywords