[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Reflecting on: csv data to xml

Subject: Re: [xsl] Reflecting on: csv data to xml
From: Adam Retter <adam.retter@xxxxxxxxxxxxxx>
Date: Mon, 1 Jul 2013 11:36:10 +0300

It may be of some interest to this thread, at The National Archives we
do a lot of CSV to XML processing using a minimally modified version
of Andrew Welch's XSLT. However as once we have the data in XML we
need to further extract and process the data we need to be certain of
the original CSV format (which subsequently enables us to be certain
of the resultant XML format, amongst other concerns). To achieve this
we have built as open source a CSV Validation tool.

The CSV Validation tool consists of a specification for a simple text
grammar that describes the format of a CSV file and rules that are
asserted against the CSV file. It also includes an implementation for
the JVM (in Scala, we also provide a Java API) which takes such a
grammar and CSV file and  performs the validation, reporting all
non-validating issues or pass. The tool is available here
github.com/digital-preservation. It should be considered beta, i.e. we
are using it internally but until now it has not been publicised. In
addition documentation is missing but the EBNF file in the source repo
describes the grammar, and running the tool without arguments gives
you the simple command line useage. I hope documentation will follow
shortly, in the mean time issues etc should be aimed at the Github

On 30 June 2013 10:49, Wolfgang Laun <wolfgang.laun@xxxxxxxxx> wrote:
> The thread "csv data to xml" was triggered by a relatively simple
> problem: converting CSV data to XML. There were one or two voices
> advocating the use of Perl (or similar) "for this kind of problem" in
> preference to XSLT, and there were claims that it would be a simple
> matter to use XSLT's analyze-string... Now I'm not going to vote
> either way - I'd just like to post some observations I made while
> investigating this. If you are impatient, skip down to "conclusion".
> I decided to implement this in Perl and was hoping to be able to
> compare this with an equivalent implementation in XSLT, concentrating
> on ease of development and maintainability. Ken's implementation
> <http://www.CraneSoftwrights.com/resources/#csv> filled the XSLT slot.
> I had a quick Perl 5 filter solution up and running in 30 minutes, no
> program parameters, hard-coded names for document and row elements,
> but using the first CSV line for obtaining the names for the cells.
> 10 Minutes of that time were spent on getting a couple of Perl
> packages from CPAN, one for parsing CSV and another one for writing XML,
> which reduced the code I had actually to write to 23 lines.
> Considering this to be too sloppy, I spent some more time, adding
> a *nix-style CLI (for file names, element names,...), data checking
> (invalid element names, excess cells in a row), default element names
> for cells (using "A", "B",...), CLI documentation etc.
> Ken's solution falls short on a few points I was able to add easily. I can't
> say how difficult they would be to add to Ken's existing solution - it might
> not be a matter of minutes for some of those add-ons.
> Conclusions
> Perl's CPAN is a great asset. Certainly, the quality of its offerings varies,
> but the packages are tested and users report on their experience. (Why
> doesn't XSLT have anything like it?)
> Ken used a proprietary (?) solution for embedding documentation that can
> be extracted into HTML. Now that's great, but it is a solitary answer to the
> problem. Perl's pod is a somewhat clunky solution but it is supported with
> a rich toolset, along with the Perl distribution. I consider the
> existence of a documentation format that is defined along with the
> language as "state of
> the art" and essential for sustainable SW development.
> XSLT is "special purpose" for XML handling and consequently easy to use,
> but it isn't better than the average language for string processing.
> -W

Adam Retter

skype: adam.retter
tweet: adamretter

Current Thread