[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Dealing mixed content with invalid node-like text


Subject: Re: [xsl] Dealing mixed content with invalid node-like text
From: "Imsieke, Gerrit, le-tex" <gerrit.imsieke@xxxxxxxxx>
Date: Mon, 05 Dec 2011 01:52:49 +0100

Karl, have you tried the TagSoup parser (http://ccil.org/~cowan/XML/tagsoup/)?

You can use it either standalone or with Saxon-PE/EE and saxon:parse-html().

It will parse the pseudo-tags as something XML-compliant (maybe at the cost of discarding them, though). But if it keeps them in a recognizable way, it will make for a useful preprocessor.

Gerrit

On 2011-12-05 01:15, Michael Kay wrote:
If you need to read a file in a format that is not XML, then in general
I would suggest you start by defining a BNF grammar for the language you
want to accept, and then write a parser for that grammar using the usual
parsing techniques (top-down or bottom-up) taught in every computer
science course. If the language is similar to XML, then it is too
complex to parse using regular expressions.

Michael Kay
Saxonica

On 04/12/2011 19:15, Karlmarx R wrote:

Hello,


I have a situation where in I need to deal mixed content text that
also come with text wthin angle brackets, converted to XML output. For
example, texts like:

"Sometext<xx>within valid node</xx> and like<II .> Title etc"
"Sometext like<1a .> Title etc,<xx>within<b>something</b> valid
node</xx> etc".

Now, the output has to be like:

<nodename>Sometext<xx>within valid node</xx> and like&lt;II .&gt;
Title etc</nodename>
<nodename>Sometext like&lt;1a .&gt; Title
etc,<xx>within<b>something</b> valid node</xx> etc</nodename>

At present I do not get things like<br/> but assume I get so, it being
valid, I should treat it as node. The point I am trying to make is,<II
.> and<1a .> like non-node things needs to be treated removing their
angle brackets to make the XML valid. Currently I use analyze-string
with a regex to deal this, which does not work correctly (due to
mistakes). But I would like to know whether there is good standard
solution to deal with these sort of text. At present each line of text
is passed to this template and treated like:

<xsl:template name="tag-text">
<xsl:param name="unparsed" required="yes"/>
<xsl:analyze-string select="$unparsed"
regex="^(.*?)&lt;(.+)&gt;(.*)&lt;/(.+)&gt;(.*?)$"> <!-- this regex has
flaws, in that fails to treat those invalid nodes -->
<xsl:matching-substring> ** do process and if necessary recuressively
call this template again **</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>

I suspect possibly there could be a better regex to get the solution I
wanted, but not sure whether xslt itself has better way to deal this.
Pls can you suggest possible solutions (incl better regex if any of
you used it successfully).

Thanks in advance,
Karl


-- Gerrit Imsieke Geschdftsf|hrer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vvckler


Current Thread
Keywords