[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

[xsl] analyze-string question


Subject: [xsl] analyze-string question
From: "Birnbaum, David J" <djbpitt@xxxxxxxx>
Date: Fri, 26 Oct 2012 00:38:29 -0400

Dear XSLT-list,

For an up-conversion of a plain-text word-list with grammatical classification
information to XML, I've been a file with lines like the following:

DRUG<OJ MO <MS-P <P 3V>>

The desired output is:

DRUG<stress>o</stress>J MO <alt>MS-P <alt>P 3V</alt></alt>

That there are angle brackets in the input isn't a problem; I can convert them
easily enough to &lt; and &gt; (or anything else, for that matter). The two
problems over which I'm stumbling are:

1. In the source document, angle brackets have two very different meanings:

a) Sometimes an angle bracket ("<" or ">") is a stand-alone (unpaired)
diacritic that tells me that the following vowel letter is stressed. That's
the case of the first "<" in the example above, and I want to remap it to
<stress> tags around the stressed vowel. Both "<" and ">" may have this
function; the former marks primary stress and the latter (which is rare)
secondary stress. I need to tag them differently.

b) At other times angle brackets delimit an alternative grammatical
classification, and in that case I want to remap open and close angle brackets
to open and close <alt> tags. In the example above, the primary grammatical
classification is the "MO" and the rest is an alternative. But ...

2. When angle brackets demarcate an alternative grammatical classification,
they may nest. In the example above, the primary grammatical classification is
"MO" with an alternative "MS-P <P 3V>". The alternative itself has a nested
structure, though; within the alternative, "MS-P" is primary and "P 3V" is
alternative.

For what it's worth, as far as I've been able to tell, there is never a stress
within an "alt" section (that is, between angle brackets that do not represent
stress, and that instead delimit an alternative grammatical classifier). It is
not the case that the stressed word always comes before the grammatical
information; there may also be stressed words later in the entry. Most entries
do not have alternative classifiers, but many do.

Until I stumbled on the nested alternative identifiers, I was using
<xsl:analyze-string> to match "&lt;(.+?)&gt; and replacing it with
<alt><xsl:value-of select="regex-group(1)"/></alt>. On a subsequent pass, I
then used <xsl:analyze-string> to match "(&lt;|&gt;)(.)", deploying
<xsl:choose> to select tags (for primary or secondary stress) based on the
value of regex-group(1), and then wrap the appropriate tags around
regex-group(2). This seemed to do what I wanted.

The strategy failed with the nested alternative classifiers, though, where

<MS-P <P 3V>>

did just what I asked for, even though it wasn't what I wanted (sigh):

<alt>MS-P <P3V</alt>>

Note the internal "<" and the trailing ">". On the second pass, the one that
was supposed handle stress, it got even worse:

<alt>MS-P <stress>P</stress>3V</alt>>

My next thought was that I wanted to process the input string the way I'm
doing, except look for matched pairs of angle brackets (representing an
alternative classifier) from the inside out, instead of from left to right. I
suspect I could get that with a regex like "&lt;([^[&lt;&gt;]+)&gt;]"
(untested, but the point is to find a "<" and a string of anything but "<" or
">" up to the first ">"), but I don't see how to use <xsl:analyze-string> for
that, since if the first pass were to yield:

<MS-P <alt>P 3V</alt>>

which is what I want, I don't know how to find the outer pair. If I do
<xsl:analyze-string> on the entire preceding value (the output of a first
pass), won't it atomize it (after all, what it's analyzing is a string),
wiping out the internal markup? And if I try to apply <xsl:analyze-string> to
the individual text nodes, the "<" and ">" aren't in the same text node.

I realize that this may be simpler than it appears to me, and perhaps even
much simpler, but at the moment I'm having trouble even conceptualizing the
problem in a way that suggests a solution. I'd be grateful for a gentle (or
even not-so-gentle) nudge in the right direction.

Thanks,

David
djbpitt@xxxxxxxxx


Current Thread
Keywords
xml