XPath on a 'broken' HTML page.
Here should go questions about transforming XML with XSLT and FOP.
-
- Posts: 42
- Joined: Sun Jul 15, 2012 12:26 pm
- Location: London
XPath on a 'broken' HTML page.
Hello,
I'm trying to scrape data off a web page that I have no control over and during testing I've saved the page to the local file system. When I execute the transform I get various errors regarding the errors within the page, is there any way I can bypass this to get to the data I want?
If I correct the errors on the page I've saved then my transform works with out error but I was intending to use Java to automate this process so this really isn't a option as such.
Does anyone have any ideas? I can think of a solution using Java by reading the HTTP response then extracting out <body> element content into an empty <html><head/>put extracted body content here</html> but I'd like to test my XPath etc. using the IDE first.
P.S. Sorry, but I'm not willing to say what the site in question is.
--
William
I'm trying to scrape data off a web page that I have no control over and during testing I've saved the page to the local file system. When I execute the transform I get various errors regarding the errors within the page, is there any way I can bypass this to get to the data I want?
If I correct the errors on the page I've saved then my transform works with out error but I was intending to use Java to automate this process so this really isn't a option as such.
Does anyone have any ideas? I can think of a solution using Java by reading the HTTP response then extracting out <body> element content into an empty <html><head/>put extracted body content here</html> but I'd like to test my XPath etc. using the IDE first.
P.S. Sorry, but I'm not willing to say what the site in question is.
--
William
-
- Site Admin
- Posts: 2095
- Joined: Thu Jan 09, 2003 2:58 pm
Re: XPath on a 'broken' HTML page.
You need to pass the HTML though a tool like NekoHTML that gets you wellformed XML (XHTML) from HTML. Then you can use all the normal XML tools on that, XPath, XSLT, XQuery, etc.
In oXygen you can use File -> Import -> HTML File to get XHTML out of HTML based on NekoHTML.
Best Regards,
George
In oXygen you can use File -> Import -> HTML File to get XHTML out of HTML based on NekoHTML.
Best Regards,
George
George Cristian Bina
Jump to
- Oxygen XML Editor/Author/Developer
- ↳ Feature Request
- ↳ Common Problems
- ↳ DITA (Editing and Publishing DITA Content)
- ↳ SDK-API, Frameworks - Document Types
- ↳ DocBook
- ↳ TEI
- ↳ XHTML
- ↳ Other Issues
- Oxygen XML Web Author
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Content Fusion
- ↳ Feature Request
- ↳ Common Problems
- Oxygen JSON Editor
- ↳ Feature Request
- ↳ Common Problems
- Oxygen PDF Chemistry
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Feedback
- ↳ Feature Request
- ↳ Common Problems
- Oxygen XML WebHelp
- ↳ Feature Request
- ↳ Common Problems
- XML
- ↳ General XML Questions
- ↳ XSLT and FOP
- ↳ XML Schemas
- ↳ XQuery
- NVDL
- ↳ General NVDL Issues
- ↳ oNVDL Related Issues
- XML Services Market
- ↳ Offer a Service