Whitespace between (not inside) elements - how to get rid of

Post by **kellner** » Wed Nov 02, 2005 1:58 am

Hello,

I just started using Oxygen for a project where I tag lexical items in a text, for instance, in the following way:

<w lemma="chocolate">chocolate</w>-<w lemma="factory">factory</w>

Different XSL transformations will be apllied to the XML file.
One prints the running text, ignoring the lemmas, while the other uses the lemma to construct an index.

Oxygen indents text by adding whitespace, and I'd like to keep it that way to make the XML file easier to read by, for instance, placing each instance of a <w>-tag on its own line.

However, I noticed that during the transformation to an HTML file, the whitespace *between* tags is maintained (i.e. contracted to one single space). This inserts unwanted spaces between the individual parts of compounds.

Is there a way to keep a nice Oxygen-ic layout to the source-file *and* to get rid of these whitespaces in the transformation to HTML?

Thanks,

Post by **george** » Wed Nov 02, 2005 10:52 am

Hi,

The oXygen format and indent action will not add spaces if you do not have a space between an element tag and the containing text, for instance the format and indent action on:

Code: Select all


<?xml version="1.0" encoding="UTF-8"?>

<test>

    <w lemma="chocolate">chocolate</w>-<w lemma="factory">factory</w>

</test>

will leave the document as it is.

I assume you have in the XML file something like:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>

<test>

    <w lemma="chocolate">chocolate</w>

    -

    <w lemma="factory">factory</w>

</test>

and you do not want the new lines around "-". In that case you can process the document first with a stylesheet that will normalize the spaces on the text nodes inside the test element, like below:

Code: Select all


<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

    <xsl:template match="node() | @*">

        <xsl:copy>

            <xsl:apply-templates select="node() | @*"/>

        </xsl:copy>

    </xsl:template>

   <xsl:template match="test/text()">

        <xsl:value-of select="normalize-space(.)"/>

    </xsl:template>

</xsl:stylesheet>

This will give as result:

Code: Select all


<?xml version="1.0" encoding="UTF-8"?><test><w lemma="chocolate">chocolate</w>-<w lemma="factory">factory</w></test>

In oXygen you can configure a transformation scenario to apply more than one XSLT styelsheet so you can set this stylesheet as the main one then add your stylesheet that converts to HTML using the Additional XSLT stylesheets action in the Edit scenario dialog.

Best Regards,
George

Post by **kellner** » Wed Nov 02, 2005 1:18 pm

Thank you. I hadn't thought this through up to the point where the whitespace between <w>-tags is of course whitespace *inside* the surrounding tag, in my case: <seg>.

I tried to apply the normalize-space function, but am not sure about the syntax.
Here's some sample XML-code:

Code: Select all


 <seg type="foot" n="b">

                     <w lemma="saṃmarda">saṃmarda</w>

                     <w lemma="saṃkṣobhita">saṃkṣobhita</w>

                     <app>

                        <rdg wit="J">saṃkṣobhita</rdg>

                        <rdg wit="C">saṃśobhita</rdg>

                     </app>

                     <w lemma="kuṇḍalānām">kuṇḍalānām</w>

                  </seg>

seg is actually /TEI.2/text/body/div/lg/lg/seg

I don't want spaces between the individual w-tags (e.g. "saṃmardasaṃkṣobhita", not "saṃmarda saṃkṣobhita ", I don't want spaces between the w-tags and following app-tags, and I also don't want the presence of app-tags to intervene between the non-spaced flow of w-content (hence: "saṃmardasaṃkṣobhita***kuṇḍalānām", where the asterisks are placeholders for some DHTML code that makes the content of app appear and disappear on clicking).

Could you kindly direct me towards how this translates into your code-example for normalizing space?

Thanks again,

Post by **george** » Wed Nov 02, 2005 1:55 pm

Hi,

For a document like:

Code: Select all


<?xml version="1.0" encoding="UTF-8"?>

<TEI.2>

    <text>

        <body>

            <div>

                <lg>

                    <lg>

                        

                        <seg type="foot" n="b">

                            <w lemma="saṃmarda">saṃmarda</w>

                            <w lemma="saṃkṣobhita">saṃkṣobhita</w>

                            <app>

                                <rdg wit="J">saṃkṣobhita</rdg>

                                <rdg wit="C">saṃśobhita</rdg>

                            </app>

                            <w lemma="kuṇḍalānām">kuṇḍalānām</w>

                        </seg>

                        

                    </lg>

                </lg>

            </div>

        </body>

    </text>

</TEI.2>

if you want to remove the spaces from both seg and app elements then you can use a stylesheet like:

Code: Select all


<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

    <xsl:template match="node() | @*">

        <xsl:copy>

            <xsl:apply-templates select="node() | @*"/>

        </xsl:copy>

    </xsl:template>

   <xsl:template match="seg/text()|app/text()">

        <xsl:value-of select="normalize-space(.)"/>

    </xsl:template>

</xsl:stylesheet>

Best Regards,
George

Post by **kellner** » Wed Nov 02, 2005 2:34 pm

I copied this xsl-stylesheet and created a transformation scenario, with the normalization-xsl as the first stylesheet and the one that actually transforms to html as an additional one.

However, the effect is not as desired: sequences of newline plus spaces/tabs are contracted into one single space.

This seems to me to be in accordance with the specification of normalize-space:

"The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space."

http://www.w3.org/TR/xpath

What I want, however, is to get rid of certain spaces, not to replace them by a single one. So it seems to me that normalize-space would not be able to achieve the desired goal in the first place. Or have I misunderstood its function?

Thanks, and best regards,

Post by **george** » Wed Nov 02, 2005 5:22 pm

Hi,

Add also a translate of spaces to nothing as below:

Code: Select all


<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

    <xsl:template match="node() | @*">

        <xsl:copy>

            <xsl:apply-templates select="node() | @*"/>

        </xsl:copy>

    </xsl:template>

   <xsl:template match="seg/text()|app/text()">

        <xsl:value-of select="translate(normalize-space(.), ' ', '')"/>

    </xsl:template>

</xsl:stylesheet>

Best Regards,
George

Post by **kellner** » Wed Nov 02, 2005 7:07 pm

This gets confusing. I have disabled the xml-to-html-transformation and am running the xml file just through the normalisation stylesheet.

This is the code before:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN" "http://www.tei-c.org/Guidelines/DTD/tei2.dtd" [



<!ENTITY % TEI.linking 'INCLUDE'>

<!ENTITY % TEI.figures 'INCLUDE'>

<!ENTITY % TEI.analysis 'INCLUDE'>

<!ENTITY % TEI.XML 'INCLUDE'>

<!ENTITY % TEI.textcrit 'INCLUDE'>

<!ENTITY % TEI.verse 'INCLUDE'>



]>

<TEI.2>

	<teiHeader>

		<fileDesc>

			<titleStmt>

				<title> Aśvaghoṣas Buddhacarita: der 3. Sarga (saṃvegotpatti) </title>

			</titleStmt>

			<publicationStmt>

				<distributor> Birgit Kellner </distributor>

			</publicationStmt>

			<sourceDesc>

				<bibl>

					<author> Aśvaghoṣa </author>

					<editor> E.H. Johnston </editor>

					<title> The Buddhacarita: Or, Acts of the Buddha. Part I: Sanskrit Text </title>

					<pubPlace> Calcutta </pubPlace>

					<publisher> Baptist Mission Press </publisher>

					<date> 1935 </date>

				</bibl>

			</sourceDesc>

		</fileDesc>

	</teiHeader>

	<text>

		<front></front>

		<body>

			<div>

				<head>Aśvaghoṣas Buddhacarita: Der 3. Sarga (saṃvegotpatti) </head>

				<lg n="1">

					<l>

						<seg type="foot" n="a">

							<w lemma="tatas"> tataḥ </w>

							<w lemma="kadācin">kadācin </w>

							<w lemma="mṛdu"> mṛdu</w><w lemma="śādvala">śādvalāni </w>

						</seg>

						<seg type="foot" n="b">

							<w lemma="pumān"> puṃs</w><w lemma="kokila">kokil</w><w lemma="unnādita">onnādita</w><w lemma="pādapa">pādapāni</w>

						</seg>

					</l>

					<l>

						<seg type="foot" n="c">

							<w lemma="śru-">śuśrāva</w>

							<w lemma="padmākara">padmākara</w><w lemma="maṇḍita">maṇḍitāni</w>

						</seg>

						<seg type="foot" n="d">

							<w lemma="gīta">gītair </w>

							<app>

								<rdg wit="J"> gītair </rdg>

								<rdg wit="C">śīte </rdg>

							</app>

							<w lemma="nibaddha">nibaddhāni </w>

							<w lemma="sa">sa </w>

							<w lemma="kānana">kānanāni </w>

						</seg>

					</l>

				</lg>

				</div></body></text></TEI.2>

This is the code after the stylesheet:

Code: Select all

<?xml version="1.0" encoding="utf-8"?>

<TEI.2 TEIform="TEI.2">

	

   <teiHeader type="text" status="new" TEIform="teiHeader">

		

      <fileDesc TEIform="fileDesc">

			

         <titleStmt TEIform="titleStmt">

				

            <title TEIform="title"> Aśvaghoṣas Buddhacarita: der 3. Sarga (saṃvegotpatti) </title>

			

         </titleStmt>

			

         <publicationStmt TEIform="publicationStmt">

				

            <distributor TEIform="distributor"> Birgit Kellner </distributor>

			

         </publicationStmt>

			

         <sourceDesc default="NO" TEIform="sourceDesc">

				

            <bibl default="NO" TEIform="bibl">

					

               <author TEIform="author"> Aśvaghoṣa </author>

					

               <editor role="editor" TEIform="editor"> E.H. Johnston </editor>

					

               <title TEIform="title"> The Buddhacarita: Or, Acts of the Buddha. Part I: Sanskrit Text </title>

					

               <pubPlace TEIform="pubPlace"> Calcutta </pubPlace>

					

               <publisher TEIform="publisher"> Baptist Mission Press </publisher>

					

               <date TEIform="date"> 1935 </date>

				

            </bibl>

			

         </sourceDesc>

		

      </fileDesc>

	

   </teiHeader>

	

   <text TEIform="text">

		

      <front TEIform="front"/>

		

      <body TEIform="body">

			

         <div org="uniform" sample="complete" part="N" TEIform="div">

				

            <head TEIform="head">Aśvaghoṣas Buddhacarita: Der 3. Sarga (saṃvegotpatti) </head>

				

            <lg n="1" org="uniform" sample="complete" part="N" TEIform="lg">

					

               <l part="N" TEIform="l">

						

                  <seg type="foot" n="a" part="N" TEIform="seg">

                     <w lemma="tatas" part="N" TEIform="w"> tataḥ </w>

                     <w lemma="kadācin" part="N" TEIform="w">kadācin </w>

                     <w lemma="mṛdu" part="N" TEIform="w"> mṛdu</w>

                     <w lemma="śādvala" part="N" TEIform="w">śādvalāni </w>

                  </seg>

						

                  <seg type="foot" n="b" part="N" TEIform="seg">

                     <w lemma="pumān" part="N" TEIform="w"> puṃs</w>

                     <w lemma="kokila" part="N" TEIform="w">kokil</w>

                     <w lemma="unnādita" part="N" TEIform="w">onnādita</w>

                     <w lemma="pādapa" part="N" TEIform="w">pādapāni</w>

                  </seg>

					

               </l>

					

               <l part="N" TEIform="l">

						

                  <seg type="foot" n="c" part="N" TEIform="seg">

                     <w lemma="śru-" part="N" TEIform="w">śuśrāva</w>

                     <w lemma="padmākara" part="N" TEIform="w">padmākara</w>

                     <w lemma="maṇḍita" part="N" TEIform="w">maṇḍitāni</w>

                  </seg>

						

                  <seg type="foot" n="d" part="N" TEIform="seg">

                     <w lemma="gīta" part="N" TEIform="w">gītair </w>

                     <app TEIform="app">

                        <rdg wit="J" TEIform="rdg"> gītair </rdg>

                        <rdg wit="C" TEIform="rdg">śīte </rdg>

                     </app>

                     <w lemma="nibaddha" part="N" TEIform="w">nibaddhāni </w>

                     <w lemma="sa" part="N" TEIform="w">sa </w>

                     <w lemma="kānana" part="N" TEIform="w">kānanāni </w>

                  </seg>

					

               </l>

				

            </lg>

				

         </div>

      </body>

   </text>

</TEI.2>

And this is the stylesheet:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

	<xsl:template match="node() | @*">

		<xsl:copy>

			<xsl:apply-templates select="node() | @*"/>

		</xsl:copy>

	</xsl:template>

	<xsl:template match="seg/text() |app/text()">

		<xsl:value-of select="translate(normalize-space(.), ' ', '')"/>

	</xsl:template>

</xsl:stylesheet>

I have pasted the result code from the Oxygen window. Does Oxygen prettify it again, so that the removal of spaces through the xsl gets undone again?

If I apply the xml-to-html transformation, spaces appear even where there didn't appear any before, so I assume that the normalisation xsl does produce xml with more newlines than before, for whatever reason.

Post by **george** » Wed Nov 02, 2005 9:12 pm

Hmmm.... we will look into this, it seems that this problem appears only with Saxon, try using Xalan as the XSLT processor for the transformation scenario.

Best Regards,
George

Post by **kellner** » Wed Nov 02, 2005 10:40 pm

Ah yes - with Xalan, the code you suggested works perfectly.

Many thanks for your patience!

Best regards,