[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Top 10 XSLT patterns

Subject: Re: [xsl] Top 10 XSLT patterns
From: Michael Sokolov <msokolov@xxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 03 Apr 2014 20:48:14 -0400

On 4/3/14 11:33 AM, Abel Braaksma (Exselt) wrote:
It will likely be non-trivial to compile such list without a good query
to search through existing stylesheets and known programming challenges.
But from your experience, what patterns do you encounter most often?

Here are some very specific concrete examples which have come up a lot for us when processing large texts. I don't see how they map to the patterns you all are discussing, but they are probably combinations of them in some way?

Something we've had to implement multiple times in various combinations (XSLT 1, 2, XQuery, JDOM/Java) is what I call the "proem" extractor: pull out the first N characters (or words) from a document, maintaining all of the ancestral markup. A more elaborate variant is to extract an intermediate section that could be defined in various ways (characters N - N+100, everything between two <mark> elements, etc). I don't know what to call that -- tree surgery? Typically the idea is to generate document summaries, hit highlighting, or annotated passages.

Another major problem for us has been reference resolution: in which a set of documents is marked up with cross references to other documents, or sub-documents, and the problem is to copy some part of the referenced document into the reference (as a performance optimization, so it doesn't have to be looked up later). The basic idea is simple enough, but is complicated by very large numbers of large documents with large numbers of references. Another complication is that the document corpus may be constantly evolving; as new documents are introduced, both outbound *and inbound* references must be resolved.

There are lots of variants of this reference resolution problem: simple links, abbreviation expansion, footnote inlining. Footnotes are especially challenging since they may contain further references to additional footnotes, so the expansion is recursive (and inevitably, circular). References might be to non-XML documents and trigger non-XML processing: specifically for image files, we would typically want to store a reference to the image file indicating whether it exists (and where, if we had to hunt for it), its size, format, etc.

A very common feature of all of our pipelines is chunking. The canonical example is pulling all the chapters out of a book document and creating standalone chapter documents plus a skeletal book document that serves as cover page and table of contents. We usually want to preserve some ancestral markup in the "chapters", and since we are generating new documents, we need to keep track of references to these documents for the TOC, for next/previous navigation links and for translating/resolving other cross-references that were intra-document, but have become inter-document.

Another dumb thing we do all the time is run a list of XPaths over a document and save the results into a Java object for easy access in our application framework. This is just a simplified version of marshalling (or unmarshalling?) to cross the language barrier (we call it xml mapping). We also use XSLT to render these XML documents as HTML, but when we need (usually atomic) values to be handled by our Java application layer, we want an easy way to extract them from the XML. For large numbers of paths, I think we would be better off doing this with a single generated XSLT (so we don't have to traverse the document once per path), but currently we don't do that.

I hope that's useful.


Current Thread