How to customize indexing and search methods

Post here questions and problems related to oXygen frameworks/document types.
caoda
Posts: 2
Joined: Tue May 09, 2023 12:28 pm

How to customize indexing and search methods

Post by caoda »

Our document is in Chinese, and the default segmentation and search methods are not very good for processing Chinese. Therefore, we want to customize the index and search methods to improve the search results of Chinese documents. I noticed that you have methods for customizing search engines: https://www.oxygenxml.com/doc/versions/ ... ngine.html
However, it is obvious that customizing a search engine does not solve the search problem because I am not yet able to customize the generated index file (index. js), and using the default index file is difficult to provide reasonable search results because the default index file segmentation logic is too simple. I have tried changing the default.language parameter to zh, but in this case, the segmentation logic is still simple and rough (every two words are used as one word), which definitely does not meet the requirements. So I think only custom index files and custom search engines can solve this problem. But I couldn't find the relevant configuration. I think using the SDK development plugin provided by you may be able to intervene in the process of generating indexes, so that the index files are defined by us ourselves. If my idea is feasible, could you provide us with an overview of how to intervene in the process of generating indexes? I have reviewed the SDK documentation, but I cannot find the relevant entry point.
beniamin_savu
Posts: 31
Joined: Fri Jan 22, 2021 11:05 am

Re: How to customize indexing and search methods

Post by beniamin_savu »

Hi,

Unfortunately we do not have any extension point to customize the generated index file in WebHelp and it cannot be done through the SDK either.

At the moment the WebHelp Responsive transformation offers 3 types of search functionalities:
  • The Default Search Engine
    The WebHelp Responsive transformation has by default a search feature with a rating mechanism that computes scores for every result that matches the search criteria. If the language is set to Chinese this Search Engine has a limitation when searching for multiple words that appear in strings without a space separator. You need to add a space to separate the words. Otherwise, WebHelp will not find results. Chinese uses a specialized character for space separators, but the current WebHelp implementation cannot detect such specialized characters, so to search for 开始之前, you have to enter 开始 之前 (notice the space between the second and third symbols) in the search field.
  • Using Oxygen Feedback search functionality
    The latest version of Oxygen Feedback comes with a search functionality which has multi language support for the WebHelp Responsive output. We suggest to use a trial for Oxygen Feedback and try its search functionality. Please let us know if the search behave better or you encounter any problems when using Oxygen Feedback for search. Also please note that the search functionality for Oxygen Feedback requires the output to be generated using Oxygen WebHelp version 25 or later. You can also check our webinar on how to enable the search functionality in Oxygen Feedback using the following link: https://www.oxygenxml.com/events/2023/w ... k_3_0.html
  • Implementing a custom search engine (A more technical solution)
    It is possible to integrate a custom search engine into your WebHelp Responsive output. This will allow you to have full control on how the search engine will behave. You will be able to implement your own mechanism of indexing the HTML files and to process the search queries. By using this method, the default Search Engine integrated in WebHelp will be disabled. You can find more details on how to implement a custom search engine using the following link: https://www.oxygenxml.com/doc/versions/ ... ngine.html
Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
caoda
Posts: 2
Joined: Tue May 09, 2023 12:28 pm

Re: How to customize indexing and search methods

Post by caoda »

Regarding the custom search engine solution you mentioned, when and how should we build our own index files? Do we still need to parse the original document files ourselves (parsing rich text files is not simple), and in this solution, we cannot highlight keywords after users click on search results to enter document details? Is this the case?

Also, I noticed that you have special support for Japanese, and you have used https://github.com/takuyaa/kuromoji.js/ Partitioning Japanese words. I think the main reason for the current poor search results is that the Chinese word segmentation logic is not good, so you only need to use a Chinese word segmentation tool with semantic analysis to solve this problem, just like you solve the Japanese word segmentation problem. It is much simpler for you to proactively solve this problem than handing it over to users to solve it. I can share with you the JS version of the Chinese word segmentation tool I found: https://github.com/pulipulichen/jieba-js 。 I think it is very simple to do similar processing for Chinese, given the precedent of special support for Japanese. I have been browsing through a large number of forum questions and your documents in the past few days, which has taken a lot of time. The issue of Chinese search has been raised since at least 2017, why have you not taken it seriously?
beniamin_savu
Posts: 31
Joined: Fri Jan 22, 2021 11:05 am

Re: How to customize indexing and search methods

Post by beniamin_savu »

Hi,

The custom search engine solution allows you to have full control over the indexing and parsing of the HTML files. Indeed it is a very technical solution in which you will have to make your own implementation of parsing the HTML file or use a third-party search engine that does that.

Integrating a third-party library in our products is a complex and difficult process. The library needs to pass multiple tests and also needs to get analyzed and approved by our legal and security departments.

However we strongly recommend to use Oxygen Feedback. It is a much easier alternative. You can make a one month trial to test it. Oxygen Feedback comes in 2 editions: cloud and enterprise. The cloud version can be accessed using the following link: https://feedback.oxygenxml.com/
The cloud edition allows you to have a fast start-up process and no software installation is required.
Have you managed to test it? Does it provide better results for your use cases?

Oxygen Feedback's search functionality is a server-side search process which can provide faster and more relevant search result to the user. It is not limited by the resources available on the user's device. The default search engine that comes bundled in WebHelp Responsive is a client-side search process which requires downloading the entire search index to the user's device, can be limited by the resources available on the user's device and it can become slow and less efficient.

To get started with Oxygen Feedback search functionality we suggest to view the webinar on How to Enable Content Indexing and Search in Oxygen Feedback

Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
Post Reply