XPath on a 'broken' HTML page.

Post by **William** » Sat Aug 04, 2012 2:32 pm

Hello,

I'm trying to scrape data off a web page that I have no control over and during testing I've saved the page to the local file system. When I execute the transform I get various errors regarding the errors within the page, is there any way I can bypass this to get to the data I want?

If I correct the errors on the page I've saved then my transform works with out error but I was intending to use Java to automate this process so this really isn't a option as such.

Does anyone have any ideas? I can think of a solution using Java by reading the HTTP response then extracting out <body> element content into an empty <html><head/>put extracted body content here</html> but I'd like to test my XPath etc. using the IDE first.

P.S. Sorry, but I'm not willing to say what the site in question is.
--
William

Post by **george** » Sat Aug 04, 2012 3:26 pm

You need to pass the HTML though a tool like NekoHTML that gets you wellformed XML (XHTML) from HTML. Then you can use all the normal XML tools on that, XPath, XSLT, XQuery, etc.
In oXygen you can use File -> Import -> HTML File to get XHTML out of HTML based on NekoHTML.

Best Regards,
George

Post by **William** » Sat Aug 04, 2012 3:47 pm

Hello George.

Read the bumph and it sounds just what I need, thank you for your time.

--
William

XPath on a 'broken' HTML page.

XPath on a 'broken' HTML page.

Re: XPath on a 'broken' HTML page.

Re: XPath on a 'broken' HTML page.