Turn off DTD validation for HTML 4.1 loose.dtd?
Posted: Tue Oct 22, 2013 4:02 pm
Hi all,
I'm processing a set of hOCR files that were generated with the following <!DOCTYPE>. Saxon-EE has issues with the DTD due to it being an SGML DTD (yes? I'm not sure).
I can strip the <!DOCTYPE> line from the HTML, but I was curious why turning off the 'DTD validation of the source' option (Options > Preferences > XSLT-FO-XQUERY > XSLT > Saxon > Saxon-HE/PE/EE) doesn't seem to have an effect.
Is this just how it is (Saxon recognizes the document as SGML and looks for the DTD), or is there a work-around that doesn't involve a preprocessing step?
Thanks!
PS I'm converting these hOCR files to XML and the error I'm getting is:
I'm processing a set of hOCR files that were generated with the following <!DOCTYPE>
Code: Select all
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
I can strip the <!DOCTYPE> line from the HTML, but I was curious why turning off the 'DTD validation of the source' option (Options > Preferences > XSLT-FO-XQUERY > XSLT > Saxon > Saxon-HE/PE/EE) doesn't seem to have an effect.
Is this just how it is (Saxon recognizes the document as SGML and looks for the DTD), or is there a work-around that doesn't involve a preprocessing step?
Thanks!
PS I'm converting these hOCR files to XML and the error I'm getting is:
Code: Select all
System ID: http://www.w3.org/TR/html4/loose.dtd
Severity: fatal
Description: The declaration for the entity "HTML.Version" must end with '>'.
Start location: 31:3