[oXygen-user] Xerces command line parsing?

Andrew Rouner
Tue Oct 17 08:50:22 CDT 2006


Hello,

I am looking for the right syntax and method to be able to batch-parse XML
files from the command line using Xerces.  I need to use Xerces as I am
attempting to replicate parsing using oXygen (which has Xerces as its
default parser).  If anyone can send along the syntax for doing this or can
point me to a resource that can help, I'd very much appreciate it.

I previously used xmllint/LIBXML to do command line parsing of my TEI files,
which worked well for files calling on the TEI xlite DTD.  I am now dealing
with files that use the full TEI and must rely on the xml catalog, i.e.:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN" "tei2.dtd" [
<!ENTITY % TEI.XML 'INCLUDE'>
<!ENTITY % TEI.mixed 'INCLUDE'>
<!ENTITY % TEI.drama 'INCLUDE'>
<!ENTITY % TEI.corpus 'INCLUDE'>
<!ENTITY % TEI.prose 'INCLUDE'>
<!ENTITY % TEI.figures 'INCLUDE'>
<!ENTITY % TEI.linking 'INCLUDE'>
<!ENTITY % TEI.transcr 'INCLUDE'>
<!ENTITY % TEI.names.dates 'INCLUDE'>
<!ENTITY % TEI.spoken 'INCLUDE'>
<!ENTITY % TEI.header 'INCLUDE'>
<!ENTITY % ISOlat1 SYSTEM
'http://www.tei-c.org/Entity_Sets/Unicode/iso-lat1.ent'> %ISOlat1;
<!ENTITY % ISOlat2 SYSTEM
'http://www.tei-c.org/Entity_Sets/Unicode/iso-lat2.ent'> %ISOlat2;
<!ENTITY % ISOnum SYSTEM
'http://www.tei-c.org/Entity_Sets/Unicode/iso-num.ent'> %ISOnum;
<!ENTITY % ISOpub SYSTEM
'http://www.tei-c.org/Entity_Sets/Unicode/iso-pub.ent'> %ISOpub;
]>

I need to use Xerces, because I find that the default parser in oXygen
(which is Xerces) can successfully parse these files (and LIBXML does not
work for files using the full TEI due to problems with the DTD).

My best understanding (which may be completely off) is that to use Xerces as
an XML parser in the command line, what I am essentially doing, is using the
syntax to run an XML file through an XSL stylesheet (on the assumption that
the source file has to validate to run successfully.

I have modified a previous stylesheet that processes all TEI elements found
in these documents, and I use this syntax:

java com.icl.saxon.StyleSheet -x org.apache.xerces.parsers.SAXParser
source_file.xml stylesheet.xsl > /dev/null

I am using Xerces as it comes with oXygen (and have not downloaded it
separately).  Since I am only really interested in parsing and not the
output, I pipe it to /dev/null.  I have the following in my bash profile for
the PATH:

CLASSPATH=$CLASSPATH:/Applications/oxygen/lib/saxon.jar:\
/Applications/oxygen/frameworks/docbook/xsl/extensions/saxon653.jar.ext:/App
lications/oxygen/lib/xercesImpl.jar
export CLASSPATH

The above command WORKS, and will pick up SOME errors, but is clearly
missing others.  Does anyone have any more straightforward syntax for just
PARSING with Xerces, or have any ideas why some errors (I have tested) are
not being reported through this process?  (One possibility is that it's just
checking well-formedness, not validity, which I need to test further.)

Thanks in advance for any help/suggestions.

Andrew

Andrew Rouner
Digital Library Services
Washington University Libraries
St. Louis, MO

EMAIL:  



> From: Oxygen XML Editor support <>
> Date: Tue, 25 Jul 2006 12:47:23 +0300
> To: Andrew Rouner <>
> Subject: Re: Differences in validators/ dtd problems?
> 
> Dear Andrew Rouner,
> 
> Thank you for contacting us.
> The default parser used by oXygen is Xerces 2.8.0 (that is the latest
> Xerces version). This looks at a first glance like a problem/bug in XMLLINT.
> If you want to invoke Xerces to parse a document from command line then
> you can do that though one of its sample applications:
> http://xerces.apache.org/xerces2-j/samples.html
> 
> Best Regards,
> George
> ---------------------------------------------------------------------
> George Cristian Bina
> <oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
> http://www.oxygenxml.com
> 
> 
>




More information about the oXygen-user mailing list