XPath on a 'broken' HTML page.
Posted: Sat Aug 04, 2012 2:32 pm
Hello,
I'm trying to scrape data off a web page that I have no control over and during testing I've saved the page to the local file system. When I execute the transform I get various errors regarding the errors within the page, is there any way I can bypass this to get to the data I want?
If I correct the errors on the page I've saved then my transform works with out error but I was intending to use Java to automate this process so this really isn't a option as such.
Does anyone have any ideas? I can think of a solution using Java by reading the HTTP response then extracting out <body> element content into an empty <html><head/>put extracted body content here</html> but I'd like to test my XPath etc. using the IDE first.
P.S. Sorry, but I'm not willing to say what the site in question is.
--
William
I'm trying to scrape data off a web page that I have no control over and during testing I've saved the page to the local file system. When I execute the transform I get various errors regarding the errors within the page, is there any way I can bypass this to get to the data I want?
If I correct the errors on the page I've saved then my transform works with out error but I was intending to use Java to automate this process so this really isn't a option as such.
Does anyone have any ideas? I can think of a solution using Java by reading the HTTP response then extracting out <body> element content into an empty <html><head/>put extracted body content here</html> but I'd like to test my XPath etc. using the IDE first.
P.S. Sorry, but I'm not willing to say what the site in question is.
--
William