Page 1 of 1

Turn off DTD validation for HTML 4.1 loose.dtd?

Posted: Tue Oct 22, 2013 4:02 pm
by bds
Hi all,

I'm processing a set of hOCR files that were generated with the following <!DOCTYPE>

Code: Select all

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
. Saxon-EE has issues with the DTD due to it being an SGML DTD (yes? I'm not sure).

I can strip the <!DOCTYPE> line from the HTML, but I was curious why turning off the 'DTD validation of the source' option (Options > Preferences > XSLT-FO-XQUERY > XSLT > Saxon > Saxon-HE/PE/EE) doesn't seem to have an effect.

Is this just how it is (Saxon recognizes the document as SGML and looks for the DTD), or is there a work-around that doesn't involve a preprocessing step?

Thanks!

PS I'm converting these hOCR files to XML and the error I'm getting is:

Code: Select all

System ID: http://www.w3.org/TR/html4/loose.dtd
Severity: fatal
Description: The declaration for the entity "HTML.Version" must end with '>'.
Start location: 31:3

Re: Turn off DTD validation for HTML 4.1 loose.dtd?

Posted: Tue Oct 22, 2013 4:46 pm
by adrian
Hi,

Yes that DTD seems to be of the SGML flavor.

The setting 'DTD validation of the source' (default disabled) refers strictly to an optional validation of the source XML. It does not affect the XML parsing (building the XML model) which always makes use of the DTD specified in the DOCTYPE since for XML that is an integral part of the XML model. The Saxon transformation is then applied on the XML model.
So this doesn't actually have to do with Saxon, but with the XML parser (Xerces) that Oxygen configures and uses. Oxygen does not provide a setting for completely bypassing the DOCTYPE during XML parsing.

One possible workaround is be to create an XML catalog that resolves the PUBLIC ID and/or SYSTEM ID to a dummy DTD (or even the real XHTML 1.0 transitional DTD).
catalog.xml

Code: Select all

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//W3C//DTD HTML 4.01 Transitional//EN" uri="dummy.dtd"/>
<system systemId="http://www.w3.org/TR/html4/loose.dtd" uri="dummy.dtd"/>
</catalog>
dummy.dtd

Code: Select all

<!ELEMENT html ANY>
Place these two files (catalog and DTD) in the same folder and configure the XML catalog in Options > Preferences, XML / XML Catalog.
You will still get validation errors because of the limited dummy.dtd, but this will allow you to use the HTML as the input of a transformation.

Note that if the HTML is not XML well-formed, this won't help with anything and you're better off importing the HTML with File > Import > HTML File....

Regards,
Adrian