WebHelp Responsive search: How do "Stop Words" work?

Post here questions and problems related to editing and publishing DITA content.
Anonymous1

WebHelp Responsive search: How do "Stop Words" work?

Post by Anonymous1 »

Hello,

first of all thank you for the new search capabilities in Oxygen 19.

We are currently translating the new search strings in our various languages. Two strings mention the "stop words", such as "of", "the", and "by".

How does this work in other languages? I can see that you have translated them into Spanish, for example. How should we proceed if we would like to add Russian, for example? Is there a way to add or remove stop words?

Thanks,

Benjamin
Anonymous1

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by Anonymous1 »

Correction: I've just realized that the Spanish translation was done by a colleague of mine and not by you. So the more general question: How should we deal with translating the strings in the WebHelp search?
radu_pisoi
Posts: 403
Joined: Thu Aug 21, 2003 11:36 am
Location: Craiova
Contact:

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by radu_pisoi »

Hi,

The procedure for localizing the WebHelp output is described in our user manual in the Localizing the Interface of WebHelp Output (for DITA Map Transformations) topic.
We are currently translating the new search strings in our various languages. Two strings mention the "stop words", such as "of", "the", and "by".
Do you need the context where these strings are used? If yes, could you tell us which are the strings you need additional information?
Radu Pisoi
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Anonymous1

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by Anonymous1 »

Thanks for your answer.

We know how to localize the WebHelp output, I mean something different here.

The search considers some words as so called "stop words". This means, they are not considered when searching for terms. There are two strings that mention stop words:

Code: Select all

No results were found because the search query only contains <span>stop words</span> that are excluded by the search engine.

Code: Select all

Stop words are very common words or adjectives that hinder search efforts. Words such as: &apos;of&apos;, &apos;the&apos;, &apos;by&apos;, etc.
We must translate those strings into our target languages (Spanish, French, Japanese, Russian, etc.).

The question now is: What do we do with the stop words (of, the, by,...)? Just because we translate them, doesn't mean that the search actually ignores them in other languages.

How does the search know, which words are stop words? And can we add stop words for other languages as well?
radu_pisoi
Posts: 403
Joined: Thu Aug 21, 2003 11:36 am
Location: Craiova
Contact:

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by radu_pisoi »

Hi,

The stop words are computed dynamically depending on the language you have chosen when you publish your documentation. They are computed by the search indexer and written in the out/webhelp-responsive/oxygen-webhelp/search/index-1.js file:

Code: Select all

stopWords = new Array();
stopWords[0]= "but";
stopWords[1]= "be";
stopWords[2]= "with";
stopWords[3]= "such";
....
So, if you want to be sure which are the stop words for a certain language, you need to inspect the index-1.js file.

There is no parameter to control the stop words.
Radu Pisoi
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Anonymous1

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by Anonymous1 »

Thank you, that helps a lot.
Gertone
Posts: 20
Joined: Mon Sep 17, 2007 10:02 am
Location: Flanders

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by Gertone »

Hi,

I do realise I am reviving a pretty old thread...

In v23 I looked for the index-1.js
but realized that the array construction has moved to
...\oxygen-webhelp\app\search\index\stopwords.js

Code: Select all

define(function() {
// Auto generated list of analyzer stop words that must be ignored by search.
return ["but","be","with","such","then","for","no","will","not","are","and","their","if","this","on","into","a","or","there","in","that","they","was","is","it","an","the","as","at","these","by","to","of"];
});
Does that imply that we can now influence the stop words?

I guess I could swap that file with a project/language dependent function, either manually or through a plugin change,
but doing it from the configuration of the customization would be my preferred path.

Thanks for other suggestions

Geert Bormans
radu_pisoi
Posts: 403
Joined: Thu Aug 21, 2003 11:36 am
Location: Craiova
Contact:

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by radu_pisoi »

Hi,

Starting with version 23, you can customize the stop words list by using the following two parameters: webhelp.search.stop.words.exclude and webhelp.search.stop.words.include. They allow you to exclude/include custom stop words.

Please see the WebHelp Responsive Transformation Parameters topic in WebHelp documentation for more details.
Radu Pisoi
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Gertone
Posts: 20
Joined: Mon Sep 17, 2007 10:02 am
Location: Flanders

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by Gertone »

Hi Radu,

Thanks for pointing me to right place in the manual
(and thank you Oxygen for adding that functionality)

I assume this can not be made language dependent other than add all languages in one parameter?
Anyhow, the functionality is extremely useful as it is already

Thanks,

Geert
radu_pisoi
Posts: 403
Joined: Thu Aug 21, 2003 11:36 am
Location: Craiova
Contact:

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by radu_pisoi »

Hi,
Gertone wrote: Mon Jan 11, 2021 4:27 pm I assume this can not be made language dependent other than add all languages in one parameter?
No, you should update exclude/include stop words parameters depending on the current language.
Radu Pisoi
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
mmgHinchey
Posts: 4
Joined: Fri Feb 04, 2022 6:09 pm

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by mmgHinchey »

I've noticed some languages do not produce language-specific stop words. (for example, simplified Chinese, Ukrainian, and Korean). Is there a list, somewhere, of what languages generate a language specific stopwords.js, and which generate a stopwords.js based on English.
Thank you
galanohan
Posts: 115
Joined: Mon Jul 10, 2023 11:49 am

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by galanohan »

I'm a native Chinese speaker and I know a little Korean due to the impact of ancient Chinese, in my opinion, what should be included or excluded in parameter "webhelp.search.stop.words.exclude" like what I specified for the English webhelp output in .opt file:
<parameter name="webhelp.search.stop.words.exclude"
value="in,at,with,and,or,from,into,not,by,if,for,as,a,an,is,no,not,of,on"/>

actually depends on how you define the efficiency of a search and the context of having a white/blacklist of stop words.

1. Context. In English, there are lots of words that are frequently used in random sentences like "is", "for", "and", "at", etc. In normal context, we don't want users to experience a long and meaningless keyword search which brings them thousands of search results that have nothing to do with the actual intention. However, in certain context, for example, if the product is about SQL language or other SQL-like database product, including keywords like "into", "by", "and", "at", "in", "as", etc. could sometimes block the search for certain SQL keywords or statements that contain such words. So it's better to exclude these words, at least exclude specific keywords like "like", "group by", "context by", etc.

2. Languge. In Chinese, Korean, Japanese, etc. there are always some words that do not mean anything specific, if they do, they function as a formal/polite word ending, such as 습니다("smida") at the end of a descriptive sentence especially in TV news or on newspaper. Chinese, especially in ancient Chinese, we have lots of similar ending words like "也", "矣","哉",these modal particles don't mean anything. In modern Chinese, we have some words, mostly adv. , such as "有时"(sometimes/иногда),“非常”(very/oчень),, and sometimes random words like "什么"(what/что),“这个”(this/это),etc. These words usually don't bring our readers meaning search results, so they should be included in the keyword avoid list.

So, instead of setting an absolute rule for various languages, consulting native speakers and asking their opinions to form up a stop words list might be a better practice.
cosmin_andrei
Posts: 138
Joined: Mon Jun 12, 2017 10:50 am

Re: WebHelp Responsive search: How do "Stop Words" work?

Post by cosmin_andrei »

Hi galanohan,

Note that there is no hardcoded stop words list in the Oxygen WebHelp code.
For the content indexing we use the Apache Lucene library and the stop words list is obtained from the Lucene library for each individual language.
Regards,
Cosmin
--
Cosmin Andrei
oXygen XML Editor and Author Support
Post Reply