Turn off DTD validation for HTML 4.1 loose.dtd?

Questions about XML that are not covered by the other forums should go here.
Posts: 46

Turn off DTD validation for HTML 4.1 loose.dtd?

Tue Oct 22, 2013 4:02 pm

Hi all,

I'm processing a set of hOCR files that were generated with the following <!DOCTYPE>

Code: Select all

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
. Saxon-EE has issues with the DTD due to it being an SGML DTD (yes? I'm not sure).

I can strip the <!DOCTYPE> line from the HTML, but I was curious why turning off the 'DTD validation of the source' option (Options > Preferences > XSLT-FO-XQUERY > XSLT > Saxon > Saxon-HE/PE/EE) doesn't seem to have an effect.

Is this just how it is (Saxon recognizes the document as SGML and looks for the DTD), or is there a work-around that doesn't involve a preprocessing step?


PS I'm converting these hOCR files to XML and the error I'm getting is:

Code: Select all

System ID: http://www.w3.org/TR/html4/loose.dtd
Severity: fatal
Description: The declaration for the entity "HTML.Version" must end with '>'.
Start location: 31:3
Posts: 2460

Re: Turn off DTD validation for HTML 4.1 loose.dtd?

Tue Oct 22, 2013 4:46 pm


Yes that DTD seems to be of the SGML flavor.

The setting 'DTD validation of the source' (default disabled) refers strictly to an optional validation of the source XML. It does not affect the XML parsing (building the XML model) which always makes use of the DTD specified in the DOCTYPE since for XML that is an integral part of the XML model. The Saxon transformation is then applied on the XML model.
So this doesn't actually have to do with Saxon, but with the XML parser (Xerces) that Oxygen configures and uses. Oxygen does not provide a setting for completely bypassing the DOCTYPE during XML parsing.

One possible workaround is be to create an XML catalog that resolves the PUBLIC ID and/or SYSTEM ID to a dummy DTD (or even the real XHTML 1.0 transitional DTD).

Code: Select all

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <public publicId="-//W3C//DTD HTML 4.01 Transitional//EN" uri="dummy.dtd"/>
  <system systemId="http://www.w3.org/TR/html4/loose.dtd" uri="dummy.dtd"/>


Code: Select all


Place these two files (catalog and DTD) in the same folder and configure the XML catalog in Options > Preferences, XML / XML Catalog.
You will still get validation errors because of the limited dummy.dtd, but this will allow you to use the HTML as the input of a transformation.

Note that if the HTML is not XML well-formed, this won't help with anything and you're better off importing the HTML with File > Import > HTML File....

Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger

Return to “General XML Questions”

Who is online

Users browsing this forum: No registered users and 0 guests