[oXygen-user] Feature request: Improvement of Japanese search for WebHelp

Wed Apr 15 06:19:46 CDT 2015

Sorin,

That will be of great help!
When you have a test or experimental build in the future, let me know and I'll be happy to test it, though I'm no expert.

In the meantime, I took a look at the result of the following Lucene/Kuromoji code with Japanese inputs.

        UserDictionary userDic = new UserDictionary( new FileReader( new File( "userdic.txt" ) ) );
        Analyzer analyzer = new JapaneseAnalyzer( userDic, JapaneseTokenizer.Mode.SEARCH,
                JapaneseAnalyzer.getDefaultStopSet(), JapaneseAnalyzer.getDefaultStopTags() );

These default parameters work fairly well even without UserDictionary().
However, the user dictionary at build-time would be a strong plus, considering that the current client-side JavaScript would miss critical keywords due to its little tweak for partial match.
So, in your future design time, please also consider making the UserDictionary() file path configurable via a WebHelp transformation parameter.

Thanks,
T. Hatanaka

________________________________________
From:  <> on behalf of Support Oxygen XML Editor (Sorin Ristache) <>
Sent: Wednesday, April 15, 2015 00:17
To: T. Hatanaka; 
Subject: Re: [oXygen-user] Feature request: Improvement of Japanese search for WebHelp

Hello,

On 4/12/2015 5:26 AM, T. Hatanaka wrote:
> That being said, it would still benefit a lot to integrate sophisticated analyzers, even if it's only at build-time. Run-time ones are less required, I guess. When the Japanese people search Web, they, human beings, usually perform a kind of tokenization and normalization by themselves.
> i.e. They do not usually enter "BROWNFOXJUMPS" in the search text box. In most cases we can expect them to type "BROWN FOX JUMP".
> Actually "Please enter keywords separated by spaces" has been a common instruction found on the Japanese search UI. People have got used to it.
> So I guess that if the index were created by a sophisticated analyzer at build-time with custom dictionaries, we could improve search experience a lot with relatively minor tweaks in run-time JavaScript.

Thank you for letting us know. In a future version we will integrate the
Kuromoji analyzer in our Apache Lucene customization that runs on the
generated WebHelp pages for building the WebHelp search index. This
index will offer relevant search result in the WebHelp pages only for
Japanese search terms entered in the browser that are properly separated
with space characters.

> Here's another piece of news:
> Kuromoji has been ported to JavaScript: https://github.com/takuyaa/kuromoji.js
> I haven't tried it, but expect some difficulties. I heard it required a 17MB dictionary.

That is too large for a client-side JavaScript operation. The
tokenization of the search string entered by the user may take forever
based on a 17 MB JavaScript dictionary. The search will rely on properly
separated search terms entered by the user, as you suggested above.

> Thanks,
> T. Hatanaka

Best regards,
Sorin

<oXygen/> XML Editor

http://www.oxygenxml.com
_______________________________________________
oXygen-user mailing list

http://www.oxygenxml.com/mailman/listinfo/oxygen-user