[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] Applying Streaming To DITA Processing: Looking for Guidance

Subject: [xsl] Applying Streaming To DITA Processing: Looking for Guidance
From: "Eliot Kimber ekimber@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 9 Oct 2014 14:16:08 -0000

In the context of DITA processing, where you have a "map" document that
links to potentially many 1000s of topic documents in order to define a
complete publication, I have an existing XSLT2 process that processes the
entire data set, walking the map and processing each referenced document
in turn, in order to construct a single XML structure that captures all
the information needed to do numbering (or any similar publication-wide
process, like index generation). This "data collection" process has the
necessary effect of parsing every document ultimately referenced from the
map, which can have a severe memory cost for large publications.

This structure is then provided as a tunnel parameter to the next phase of
processing, where the final deliverable result is generated (e.g., HTML
pages for a Web site, EPUB, etc.).

I know that with Saxon I could save memory today by discarding documents
after the first phase but then I'd have to reparse them and that can incur
a steep cost as well. But that would be a Saxon-specific optimization and
I'm trying to avoid being tied to a specific XSLT implementation.

My question with regard to streaming in this use case:

Can streaming help, either with overall processing efficiency or with
memory usage?

Where would I go today or in the near future to gain the understanding of
streaming required to answer these questions (other than the XSLT 3 spec
itself, obviously)?

Because my data collection process is copying data to a new result, I'm
pretty sure it's inherently streamable: I'm just processing documents in
an order determined by a normal depth-first tree walk of the map structure
(a hierarchy of hyperlinks to topics) and grabbing relevant data (e.g.,
division titles, figure titles, index entries, etc.). If this was all I
was doing, then for sure streaming would help memory usage.

But because I must then process each topic again to generate the final
result, and that process is not directly streamable, would streaming the
first phase help overall?

Taken a step further: are there implementation techniques I could apply in
order to make the second phase streamable (e.g., collecting the
information needed to render cross references without having to fetch the
target elements) and could I expect that to then provide enough
performance improvement to justify the implementation cost? The current
code is both mature and relatively naive in its implementation. Reworking
it to be streamable could entail a significant refactoring (maybe, that's
part of what I'm trying to determine).

The actual data processing cost is more or less fixed, so unless streaming
makes the XSLT operations faster, I wouldn't expect streaming by itself to
reduce processing time.

However, the primary concern in this use case is memory usage: currently,
memory required is proportional to the number of topics in the
publication, whereas it could be limited to simply the largest topic plus
the size of the collected data itself (which is obviously much smaller
than the size of the topics as it includes the minimum data needed to
enable numbering and such).



Eliot Kimber, Owner
Contrext, LLC

Current Thread