XPath on a 'broken' HTML page.

Here should go questions about transforming XML with XSLT and FOP.
William
Posts: 42
Joined: Sun Jul 15, 2012 12:26 pm
Location: London

XPath on a 'broken' HTML page.

Post by William »

Hello,

I'm trying to scrape data off a web page that I have no control over and during testing I've saved the page to the local file system. When I execute the transform I get various errors regarding the errors within the page, is there any way I can bypass this to get to the data I want?

If I correct the errors on the page I've saved then my transform works with out error but I was intending to use Java to automate this process so this really isn't a option as such.

Does anyone have any ideas? I can think of a solution using Java by reading the HTTP response then extracting out <body> element content into an empty <html><head/>put extracted body content here</html> but I'd like to test my XPath etc. using the IDE first.

P.S. Sorry, but I'm not willing to say what the site in question is.
--
William
george
Site Admin
Posts: 2095
Joined: Thu Jan 09, 2003 2:58 pm

Re: XPath on a 'broken' HTML page.

Post by george »

You need to pass the HTML though a tool like NekoHTML that gets you wellformed XML (XHTML) from HTML. Then you can use all the normal XML tools on that, XPath, XSLT, XQuery, etc.
In oXygen you can use File -> Import -> HTML File to get XHTML out of HTML based on NekoHTML.

Best Regards,
George
George Cristian Bina
William
Posts: 42
Joined: Sun Jul 15, 2012 12:26 pm
Location: London

Re: XPath on a 'broken' HTML page.

Post by William »

Hello George.

Read the bumph and it sounds just what I need, thank you for your time.

--
William
Post Reply