Search for letter s that excludes long s (and vice versa)

Having trouble installing Oxygen XML WebHelp? Got a bug to report? Post it all here.
david_himself
Posts: 43
Joined: Mon Oct 01, 2018 7:29 pm

Search for letter s that excludes long s (and vice versa)

Post by david_himself »

The search engine in Find/Replace in Files seems to lump long s (Unicode u0175) and ordinary s (u0073) together, as if they were equivalent.

This is true for plain or regex searches. How do I search for a word like [i]express[/i] that contains a double s (both of them ordinary ascii s) while excluding from my search the spelling [i]expreſs[/i] with long s followed by ordinary s? More generally, how can I specify that I want ordinary s in a search string but not long s, or vice versa?

Thanks.
David
Radu
Posts: 9220
Joined: Fri Jul 09, 2004 5:18 pm

Re: Search for letter s that excludes long s (and vice versa)

Post by Radu »

Hi David,
I just tested creating a file with these words inside:

Code: Select all

express expreſs
And then using both Oxygen's Find/Replace dialog and "Find/Replace in Files" to search for "express". If the "Regular expression" checkbox is not checked, the search should locate only the first word as a match. So can you double check that in the "Find/Replace in Files" dialog you unchecked the "Regular expression" checkbox? Also please uncheck the "Ignore extra whitespaces" checkbox as it also uses the regexp search engine.
Indeed if the "Regular expression" checkbox is enabled, both words are found. We seem to apply a more relaxed unicode case match when the search is done with case sensitivity disabled. So if for example you check the "Case sensitive" checkbox you should again match only the first word, only if regexp search is enabled.

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
david_himself
Posts: 43
Joined: Mon Oct 01, 2018 7:29 pm

Re: Search for letter s that excludes long s (and vice versa)

Post by david_himself »

That is very interesting, and I confirm that in both Find/Replace and Find/Replace in Files, either of the checkboxes Regex or Ignore extra whitespaces, or both, will cause the search engine to mix long and short s indiscriminately in the hits, whichever one you use in the search string, whereas unchecking both of those boxes will cause the search to discriminate correctly according to the form used in the search string. That is very helpful.

In our mixed content folder of 1,598 XML files, Find/Replace in files with a suitable XPath filter produces the following, remarkable results:

Regex Ignore extra whitespaces Case sensitive Search term hits
- - - express 5 (only ss)
- - - expreſs 335 (only ſs, AFAIK [didn't check all])
+ - - express 344 (ſs and ss)
- + - express 344 (ſs and ss)
- - + express 8 (only ss) -- why does case sensitivity add 3 more?
- - + Express 0
- - + expreſs 316 (only ſs, AFAIK)
- - + Express 19 (only ſs)
+ - + expreſs 316 (only ſs, AFAIK)
+ - + express 5 (only ss)

(I haven't tested the remaining permutations.) Checking case sensitivity retains the discrimination if the other two boxes are unchecked. However, certain permutations of options get only 5 examples with l.c. e- and medial -ss-, while others find 8. All very baffling.

I'm also interested to know where it is recorded that long s is related to ascii s. Is that info part of Unicode, or is it a feature of certain search engines?
best wishes
David
Radu
Posts: 9220
Joined: Fri Jul 09, 2004 5:18 pm

Re: Search for letter s that excludes long s (and vice versa)

Post by Radu »

Hi David,
We are using the regular Java regular expressions engine to perform the search.
In order to use case insensitive match we enable the case insensitive flag (similar to the regexp ?i):
https://docs.oracle.com/javase/8/docs/a ... NSENSITIVE
but as documented this flag in itself would only match ASCII characters so we add this flag in order to case insensitive match even non ASCII characters (similar to the regexp ?u):
https://docs.oracle.com/javase/8/docs/a ... ICODE_CASE
and this combination of flags seems to also consider equivalent certain characters which do not have the same code point but represent the same letter like in your case. Right now I do not see a solution on our side for this as allowing CASE_INSENSITIVE without setting UNICODE_CASE would mean the find/replace would no longer properly match non ascii characters (like Japanese, Chinese). So for now you will need to check the "Case sensitive" checkbox at least when searching for such strings.
Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
david_himself
Posts: 43
Joined: Mon Oct 01, 2018 7:29 pm

Re: Search for letter s that excludes long s (and vice versa)

Post by david_himself »

Many thanks!
best
David
Post Reply