[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] Over 300 MB XML file and XSLT or XQuery


Subject: [xsl] Over 300 MB XML file and XSLT or XQuery
From: alan m <highmarkz@xxxxxxxxx>
Date: Wed, 12 Jan 2005 17:30:31 -0800 (PST)

This reply post somehow got lost a few days ago. 
I am emailing again.

I forgot to mention that memory requirements are
limited. Even in one GB system, 300 MB file cannot be
processed as the DOM can take as much as 10 times the
file size. It would crash the system.

The big XML file has no references to small xml file
names although small xml files are generated using STX
and XSLT tranforms and contains the content from the
big file but order or structure maybe different than
that of the big xml file. I used STX in the first
place to break down the big xml files into much
smaller files and then further XSLT processing on
these small files.
STX doc is at http://stx.sourceforge.net/documents/
and I used Joost implementation.

One good reference was given by Raffaele Sena:
http://dsd.lbl.gov/nux/
which seems promising to solve the problem of dealing
with big XML file.

To Michael Kay. Performance is not an issue. I am very
new to XQuery. I would like to get my hands dirty with
XQuery to learn a new trick of the trade but would
like to follow technically correct approach to solve
this kind of problem.
Lets assume I have solved the big XML file problem and
now given a text node, I need to search for this text
in the tens of thousands of small xml or html files,
generate stats like where it was found, how many times
etc. and if not found generate meaningful logs. I can
write Java classes if necessary.

I would want to avoid converting small files into one
large file. I was thinking about treating collection
of all small files as an XML database and use Xquery.

from Michael Kay: 
>Another solution, again dependent on XSLT, is to use
>grouping. This doesn't
>require the small documents to be aggregated into a
>single document. If you
>take the union of the text nodes in the large
>document and the values in the
>small documents, and then do grouping, a group of
>size 1 indicates a value
>that is present in one file and not the other.

I would like more clarification about above approach.
Also is this XQuery or XSLT?

This is in reference to original post:
""""""""""""""""""""""""""""""""""""""""""
I have extremely large (over 300 MB) XML file and tens
of thousands of small xml files generated after
applying various XSLT on the one big XML file.

I am using Saxon for XSLT and will be using it also
for XQuery.

Is Xquery or XSLT is better solution for this problem?
Query each text node in the big xml file and verify
that this content is present in one of the results xml
files. Based on this information generate a report
that shows which content is present and in which file
and in a separate section which content was not found
in result xml files and also show this content parent
element or other markup to indicate its position in
the big xml file.

All the small xml files are stored as flat files in
various directories on Windows File system although
most files are in one directory. The big XML file is
fairly complex with multiple levels of nesting
elemenents.

Any comments or suggestions?
Thank you
"""""""""""""""""""""""""""""""""""""

-Alan




		
__________________________________ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 


Current Thread
Keywords