[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] Aligning/merging two sequences


Subject: Re: [xsl] Aligning/merging two sequences
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Thu, 30 Sep 2010 18:08:32 +0100

I don't think it's straightforward at all - people have spent years perfecting algorithms for finding diffs between two sequences. I'm no expert on this area, but if I had the problem I would start by searching for appropriate algorithms before even thinking about writing an XSLT implementation. Presumably there's a trade-off between the time spent and the perfection of the result.

Michael Kay
Saxonica

On 30/09/2010 5:51 PM, Markus Flatscher wrote:
I'm banging my head against a sequence alignment problem. I have a feeling that this is straightforward, but I can't put my finger on what's missing from my attempts.

Suppose I have two inputs like so, where input1//w is always a subset of input2//w:

<input1>
<w n="1">I</w>
<w n="2">am</w>
<w n="3">a</w>
<w n="4">sequence</w>
</input1>

<input2>
<w>I</w>
<w>am</w>
<w>a</w>
<w>longer</w>
<w>longer</w>
<w>sequence</w>
</input2>

I'd like to get output like so:

<output>
<w n="1">I</w>
<w n="2">am</w>
<w n="3">a</w>
<w n="skipped">longer</w>
<w n="skipped">longer</w>
<w n="4">sequence</w>
</output>

I.e., for each input1//w, @n should be copied to the nearest following sibling <w> in input2 that matches .; <w>s in input2 that aren't in input1 should be flagged as "skipped".

P.S.: The use case is aligning an imperfect but timestamped transcription of an audio file (input1, machine-generated) with a perfect but not-timestamped one (input2, human-generated).

Thanks much for any help,

Markus


Current Thread
Keywords