Page 1 of 1

WebHelp Responsive search: How do "Stop Words" work?

Posted: Wed May 03, 2017 5:34 pm
by Anonymous1
Hello,

first of all thank you for the new search capabilities in Oxygen 19.

We are currently translating the new search strings in our various languages. Two strings mention the "stop words", such as "of", "the", and "by".

How does this work in other languages? I can see that you have translated them into Spanish, for example. How should we proceed if we would like to add Russian, for example? Is there a way to add or remove stop words?

Thanks,

Benjamin

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Thu May 04, 2017 12:35 pm
by Anonymous1
Correction: I've just realized that the Spanish translation was done by a colleague of mine and not by you. So the more general question: How should we deal with translating the strings in the WebHelp search?

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Thu May 04, 2017 2:37 pm
by radu_pisoi
Hi,

The procedure for localizing the WebHelp output is described in our user manual in the Localizing the Interface of WebHelp Output (for DITA Map Transformations) topic.
We are currently translating the new search strings in our various languages. Two strings mention the "stop words", such as "of", "the", and "by".
Do you need the context where these strings are used? If yes, could you tell us which are the strings you need additional information?

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Thu May 04, 2017 4:08 pm
by Anonymous1
Thanks for your answer.

We know how to localize the WebHelp output, I mean something different here.

The search considers some words as so called "stop words". This means, they are not considered when searching for terms. There are two strings that mention stop words:

Code: Select all

No results were found because the search query only contains <span>stop words</span> that are excluded by the search engine.

Code: Select all

Stop words are very common words or adjectives that hinder search efforts. Words such as: &apos;of&apos;, &apos;the&apos;, &apos;by&apos;, etc.
We must translate those strings into our target languages (Spanish, French, Japanese, Russian, etc.).

The question now is: What do we do with the stop words (of, the, by,...)? Just because we translate them, doesn't mean that the search actually ignores them in other languages.

How does the search know, which words are stop words? And can we add stop words for other languages as well?

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Thu May 04, 2017 10:05 pm
by radu_pisoi
Hi,

The stop words are computed dynamically depending on the language you have chosen when you publish your documentation. They are computed by the search indexer and written in the out/webhelp-responsive/oxygen-webhelp/search/index-1.js file:

Code: Select all

stopWords = new Array();
stopWords[0]= "but";
stopWords[1]= "be";
stopWords[2]= "with";
stopWords[3]= "such";
....
So, if you want to be sure which are the stop words for a certain language, you need to inspect the index-1.js file.

There is no parameter to control the stop words.

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Fri May 05, 2017 11:33 am
by Anonymous1
Thank you, that helps a lot.

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Sun Jan 10, 2021 4:14 pm
by Gertone
Hi,

I do realise I am reviving a pretty old thread...

In v23 I looked for the index-1.js
but realized that the array construction has moved to
...\oxygen-webhelp\app\search\index\stopwords.js

Code: Select all

define(function() {
// Auto generated list of analyzer stop words that must be ignored by search.
return ["but","be","with","such","then","for","no","will","not","are","and","their","if","this","on","into","a","or","there","in","that","they","was","is","it","an","the","as","at","these","by","to","of"];
});
Does that imply that we can now influence the stop words?

I guess I could swap that file with a project/language dependent function, either manually or through a plugin change,
but doing it from the configuration of the customization would be my preferred path.

Thanks for other suggestions

Geert Bormans

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Mon Jan 11, 2021 11:18 am
by radu_pisoi
Hi,

Starting with version 23, you can customize the stop words list by using the following two parameters: webhelp.search.stop.words.exclude and webhelp.search.stop.words.include. They allow you to exclude/include custom stop words.

Please see the WebHelp Responsive Transformation Parameters topic in WebHelp documentation for more details.

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Mon Jan 11, 2021 4:27 pm
by Gertone
Hi Radu,

Thanks for pointing me to right place in the manual
(and thank you Oxygen for adding that functionality)

I assume this can not be made language dependent other than add all languages in one parameter?
Anyhow, the functionality is extremely useful as it is already

Thanks,

Geert

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Wed Jan 13, 2021 11:43 am
by radu_pisoi
Hi,
Gertone wrote: Mon Jan 11, 2021 4:27 pm I assume this can not be made language dependent other than add all languages in one parameter?
No, you should update exclude/include stop words parameters depending on the current language.

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Wed Sep 06, 2023 11:30 pm
by mmgHinchey
I've noticed some languages do not produce language-specific stop words. (for example, simplified Chinese, Ukrainian, and Korean). Is there a list, somewhere, of what languages generate a language specific stopwords.js, and which generate a stopwords.js based on English.
Thank you

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Mon Sep 11, 2023 3:25 am
by galanohan
I'm a native Chinese speaker and I know a little Korean due to the impact of ancient Chinese, in my opinion, what should be included or excluded in parameter "webhelp.search.stop.words.exclude" like what I specified for the English webhelp output in .opt file:
<parameter name="webhelp.search.stop.words.exclude"
value="in,at,with,and,or,from,into,not,by,if,for,as,a,an,is,no,not,of,on"/>

actually depends on how you define the efficiency of a search and the context of having a white/blacklist of stop words.

1. Context. In English, there are lots of words that are frequently used in random sentences like "is", "for", "and", "at", etc. In normal context, we don't want users to experience a long and meaningless keyword search which brings them thousands of search results that have nothing to do with the actual intention. However, in certain context, for example, if the product is about SQL language or other SQL-like database product, including keywords like "into", "by", "and", "at", "in", "as", etc. could sometimes block the search for certain SQL keywords or statements that contain such words. So it's better to exclude these words, at least exclude specific keywords like "like", "group by", "context by", etc.

2. Languge. In Chinese, Korean, Japanese, etc. there are always some words that do not mean anything specific, if they do, they function as a formal/polite word ending, such as 습니다("smida") at the end of a descriptive sentence especially in TV news or on newspaper. Chinese, especially in ancient Chinese, we have lots of similar ending words like "也", "矣","哉",these modal particles don't mean anything. In modern Chinese, we have some words, mostly adv. , such as "有时"(sometimes/иногда),“非常”(very/oчень),, and sometimes random words like "什么"(what/что),“这个”(this/это),etc. These words usually don't bring our readers meaning search results, so they should be included in the keyword avoid list.

So, instead of setting an absolute rule for various languages, consulting native speakers and asking their opinions to form up a stop words list might be a better practice.

Re: WebHelp Responsive search: How do "Stop Words" work?

Posted: Mon Sep 11, 2023 11:50 am
by cosmin_andrei
Hi galanohan,

Note that there is no hardcoded stop words list in the Oxygen WebHelp code.
For the content indexing we use the Apache Lucene library and the stop words list is obtained from the Lucene library for each individual language.