Page 1 of 1

Regular expression issue

Posted: Fri Mar 30, 2012 7:25 pm
by martindholmes
Hi there,

I'm trying to find TEI <choice> tags with a particular configuration (where there is a long s in the <orig> tag and no long s in the <reg>. I have a regex like this:

<choice>\s*<orig>(\w*ſ\w*)\s*</orig>\s*<reg>\s*[^ſ]+\s*</reg>\s*</choice>

But this fails to find anything. If I replace the second instance of "ſ" with another character, such as "a", it successfully finds things:

<choice>\s*<orig>(\w*ſ\w*)\s*</orig>\s*<reg>\s*[^a]+\s*</reg>\s*</choice>

In other words, it finds a tag like this:

<choice><orig>Reſte</orig><reg>Reste</reg></choice>

where there is an "ſ" in the <orig>, and no "a" in the <reg>. However, neither this:

<choice>\s*<orig>(\w*ſ\w*)\s*</orig>\s*<reg>\s*[^ſ]+\s*</reg>\s*</choice>

nor the unicode-numeric-escape version:

<choice>\s*<orig>(\w*ſ\w*)\s*</orig>\s*<reg>\s*[^\u017f]+\s*</reg>\s*</choice>

will find anything. It seems as though the "ſ" works OK when it's in a positive match, but fails when it's part of a negated character class. Am I missing something here, or is this a bug in the regex engine?

I'm using "Find in files", and the tags typically do not run over multiple lines. I have Oxygen 13.2 2012030716 running on Ubuntu Lucid 64-bit.

Re: Regular expression issue

Posted: Wed Apr 04, 2012 6:09 pm
by adrian
Hello,

Apologies for the late reply.

The problem is that ſ("long s") is equivalent with "s" when configured to be case insensitive(default). This happens because Oxygen configures the regular expression engine to be Unicode case insensitive.
e.g. "s", "S" and all Unicode derivative characters of "s" are considered equivalent when the search is case insensitive.

This can be easily resolved by enabling the "Case sensitive" option in the Oxygen dialog.

Regards,
Adrian