Page 1 of 1

Using collection() with archives

Posted: Wed Sep 01, 2010 6:15 am
by Eiríkr
I'm working on extracting content from MS Excel 2007 format. Ideally, I should be able to work with the archive directly; I can indeed access content without manually unzipping the Excel file first, so that's a good beginning.

However, I've run into some challenges when trying to dynamically load *.xml files located within the *.xlsm archive.

I don't know ahead of time how many worksheets the Excel workbook might contain, so I need to be able to look through multiple sheets, named internally sheet1.xml, sheet2.xml, ... sheetN.xml. Borrowing the wisdom of others found via Google, it sounds like the collection() function is a good way to leverage xsl:for-each constructs to work with multiple files.

However, attempting to use collection() produces an ArchiveEntryNotFoundException, though I can find nothing wrong with the path.

The problematic XSL:

Code: Select all

<xsl:for-each select="for $f in
collection(
concat($BASE_PATH, '!/xl/worksheets/?select=sheet*.xml;recurse=yes;on-error=warning')
)
return $f">
... do something with each file ...
</xsl:for-each>
The error I get:

Code: Select all

SystemID: I:\My Documents\OxygenXMLEditor\Projects\[proj_base]\XL-MT\XL2MT.xsl
Severity: error
Description: de.schlichtherle.io.ArchiveController$ArchiveEntryNotFoundException: I:\My Documents\OxygenXMLEditor\Projects\[proj_base]\XL-MT\Source\Working.xlsm\xl\worksheets (no such file entry) - I:\My Documents\OxygenXMLEditor\Projects\[proj_base]\XL-MT\Source\Working.xlsm\xl\worksheets (no such file entry)
Start location: 62:0
Is the collection() function simply not capable of handling archives? Have I goofed here somehow? Removing the recurse=yes;on-error=warning portion does not change the outcome.

Any advice appreciated.

Cheers,

-- Eiríkr

Re: Using collection() with archives

Posted: Wed Sep 01, 2010 10:51 am
by adrian
Hello,

The archive support in Oxygen(through URI) allows access only to files(archive entries). The URI you are composing refers a folder from inside the archive("worksheets") on which it executes a select. It is assumed that "worksheets" is a file(archive entry) and it fails accordingly. This support was never implemented because it isn't used by the Oxygen GUI and we thought it would be inaccessible by other means.

I'm afraid this means you cannot use the sheet discovery method you've used here, you have to know how many sheets are there and use their file names accordingly.

Regards,
Adrian

Re: Using collection() with archives

Posted: Wed Sep 01, 2010 7:55 pm
by Eiríkr
Thank you Adrian, that's very informative.

A few questions come to mind:
  • Is this archive access feature specific to Oxygen, or is it part of (or at least usable by) the underlying Saxon XSL engine?
  • Will this kind of archive accessibility scheduled for implementation in future? If not, how do I request this?
  • Is there any Oxygen XSL feature to programmatically unzip an archive to a temp directory or other location, thereby allowing normal file and directory access? I suspect not, but no harm in asking. :)
I'm still learning how Excel 2007 stores things; I'm pretty sure I saw a couple other possible ways of learning sheet numbers and names, so at least this missing archive functionality isn't a showstopper.

Cheers,

-- Eiríkr

Re: Using collection() with archives

Posted: Thu Sep 02, 2010 10:52 am
by adrian
Hi,
Is this archive access feature specific to Oxygen, or is it part of (or at least usable by) the underlying Saxon XSL engine?
The archive access feature that you are using(zip:file protocol) is a feature specific to Oxygen. This allows read/write access to a file(archive entry) inside an archive.
You can also use the Java built-in jar access feature(also works with ZIP files) that works in any Java application but it is read-only. The URIs are about the same, you just have to change the 'zip:file' protocol with 'jar:file'.
e.g.
jar:file:/path/to/zip/my.zip!/path/inside/zip/my.resource
This kind of URI works even in Firefox.
Will this kind of archive accessibility scheduled for implementation in future? If not, how do I request this?
I'm afraid this support wouldn't be useful for Saxon even if we were to implement it.
From my tests the 'select' only works with the file protocol, probably something that Saxon does internally. I couldn't make it work with ftp or http so it wouldn't work with zip:file or jar:file either.
Is there any Oxygen XSL feature to programmatically unzip an archive to a temp directory or other location, thereby allowing normal file and directory access? I suspect not, but no harm in asking. :)
Sorry, there's no XSL feature. Although, you could do this if you were to write a Java extension that uses the TrueZIP API([oxygen-install-folder]/lib/truezip-6.jar]. You can also find the library here and use it separately from Oxygen: https://truezip.dev.java.net/

Regards,
Adrian

Re: Using collection() with archives

Posted: Thu Sep 02, 2010 10:30 pm
by Eiríkr
All good to know, thank you Adrian! I've since found other workarounds for this particular project, but I will keep your insights in mind for future.

Cheers,

-- Eiríkr