[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: html to xml


Subject: Re: html to xml
From: tra@xxxxxxxxxxxxxxx (Thorbjoern Ravn Andersen)
Date: Fri, 27 Oct 2000 14:17:07 +0200

* Sebastian Rahtz <sebastian.rahtz@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> [Oct 27. 2000 12:58]:
> Lisa van Gelder writes:
>  > The basic problem is that the html you are getting is not structured enough
>  > for your purposes.
>  > 
>  > I had the same problem, and solved it by setting rules for how the html
>  > could be structured, so it could be converted into xml more easily. I do not
>  > allow any text that is not surrounded by tags.
> 
> I was afraid someone would say that. My problem is that the task is to
> convert our existing web pages (6196 documents, at last count) to (TEI DTD)
> XML. So I have no control over the original coding. So the conclusion
> is, I guess, "clean up the HTML minimally even before running tidy".

Could you introduce an XSLT step that said that all
text()-nodes with a h1..h6 tag as their immediate parent, should be
enclosed in <p>-tags?

-- 
  Thorbjørn Ravn Andersen                   "...sound of... Tubular Bells!"
  http://bigfoot.com/~thunderbear


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



Current Thread
Keywords