[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Grouping of text input file lines


Subject: Re: [xsl] Grouping of text input file lines
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Sun, 11 Aug 2013 18:44:52 +0100

I've generally done this using your second approach: convert each line to an
element and then use group-starting-with to group them.

In XSLT 3.0 we're allowing patterns to match atomic values, so you can do
group-starting-with on a sequence of strings.

Michael Kay
Saxonica

On 11 Aug 2013, at 15:46, Wolfgang Laun wrote:

> I'll briefly describe the problem and outline two approaches to a
> solution. I'd be pleased to receive a comment or two.
>
> The task is to convert a plain text file to XML using XSLT 2.0. The
> text file contains lines, all according to
>  tag: value
> and these lines are grouped at three levels: "database", "relation"
> and "field", where each entity has some options and one or more
> children of the lower level (except for field, of course).
>
> Example, indentation according to nesting level:
>
> node: abc    # a DB option
> key: CMOS   # a DB option
> rel: rlo_one
>  com: a relation # a relation option
>  alg: direct         # a relation option
>  ele: fa int
>    com: blurb       # element (field) options
>    def: 0
>    acc: px
>    acc: py
>  ele: fb chars
>    com: bla bla
>    def: "----"
>    alg: permute
>  num: 100          # a relation option
> rel: rlo_two
>  com: another relation    # a relation option
>  com: more comment
>  com: yet more comment
>  ele: fx int
>    com: blurb
>    def: 0
>    acc: px
>  ele: fy int
>    com: bla bla
>    def: 42
>  num: 50                   # a relation option
>
> The expected XML structure is obvious, I think: a sequence of DB
> options and relation elements; these contain relation options and
> field elements, which contain field options. Field order must not be
> changed. "com" entries should be joined while observing line breaks,
> and "acc" entries too, but joined with a space.
>
> The first basic idea I used throughout is to maintain another string
> sequence in parallel to the one containing the text lines. That
> sequence contains just the tags, so that index-of can be used to
> compute "interesting" line numbers. This way, subsequences of lines
> for all or individual relations and fields can be conveniently
> extracted.
>
> The second idea is to use grouping. The sequence of lines is converted
> to a sequence of nodes <tag>value</tag> and a nested
> group-starting-with separates relations and fields - almost. As you
> can see, there's some leading lines defining DB options, and each
> relation contains option lines before and after the element groups.
> Most likely, cherry-picking lines and line groups prior to the
> glorious for-each-group has to be done using the technique described
> above.
>
> Any better ideas?
> Thanks


Current Thread
Keywords