[oXygen-user] Feature request: Improvement of Japanese search for WebHelp

Support Oxygen XML Editor (Sorin Ristache)
Thu Apr 16 05:07:09 CDT 2015


Hi,

I did some research on the user dictionaries and my understanding is 
that setting a user dictionary is a great enhancement for morphological 
Lucene analyzers like the ones for the CJK languages, for example the 
JapaneseAnalyzer, and also for indexing domain-specific content in any 
language.

So we should add a new parameter to the WebHelp page generation process 
for setting a user dictionary which makes sense for any language of the 
page content. The user dictionary has a simple and direct use for a 
morphological analyzer like the JapaneseAnalyzer because it is just a 
parameter of the class constructor, but is a little more complicated to 
integrate into the sequence of Lucene filters that follows an initial 
Lucene tokenizer in the typical Lucene processing pipeline for non-CJK 
languages.


Thank you for your suggestions,
Sorin

<oXygen/> XML Editor

http://www.oxygenxml.com


On 4/15/2015 6:13 PM, T. Hatanaka wrote:
> Hi,
>
>> What would the custom Japanese user dictionary add to the build-time
>> indexing process:
>>
>> - a list of domain-specific words
>
> It will be the most important.
>
>> - or some generic tweaks that would allow matching a client-side search
>
> Presumably it will be JavaScript's role just as its wordsStartsWith() (I guess) does currently.
>
>> the generic tweaks added by a custom user
>> dictionary would not be needed anymore. Could you give a short example
>> please?
>
> Suppose there is a Japanese sentence whose structure resembles "OVERPAYPALEBAYBERRY."
>
> Kuromoji or any analyzer may segment it as "over,paypal,ebay,berry", "overpay,pale,bayberry", "over,pay,pal,e,bay,berry" or so, depending on its built-in dictionary and method. There is no single authoritative answer.
>
> With one of such imperfect indexes, the user may search for 'over', 'overpay', 'paypal', 'pale', 'ale', 'ebay', 'bay', 'bayberry', 'verpaypalebaybe' or even 'e'... (The last two may look ridiculous, but such array of characters can be a self-contained atomic well-known word in the Japanese language.)
>
> Kuromoji is clever enough to return multiple terms such as "(paypal|pay,pal),(ebay|e,bay)", but that would never be perfect. Hence the user dictionary can play a critical role to let the analyzer know the novel word "verpaypalebaybe" or boost the priority of "ale".
>
> Thanks,
> T. Hatanaka
> ________________________________________
> From:  <> on behalf of Support Oxygen XML Editor (Sorin Ristache) <>
> Sent: Wednesday, April 15, 2015 21:14
> To: T. Hatanaka; 
> Subject: Re: [oXygen-user] Feature request: Improvement of Japanese search for WebHelp
>
> Hello,
>
> Thank you for telling us, we will try to integrate the Kuromoji
> analyzer into the Apache Lucene system that indexes the WebHelp pages.
>
> What would the custom Japanese user dictionary add to the build-time
> indexing process:
>
> - a list of domain-specific words that are relevant for the domain of
> the current DITA map and that are missing in the generic dictionary that
> comes with the Kuromoji analyzer,
>
> - or some generic tweaks that would allow matching a client-side search
> term with a partial match in the index built based on the WebHelp pages?
>
> I thought Kuromoji was a morphological analyzer that builds the index so
> that a client-side search term will be matched with an indexed term
> picked up from a WebHelp page even though the indexed term is only a
> partial match, which means the generic tweaks added by a custom user
> dictionary would not be needed anymore. Could you give a short example
> please?
>
>
> Thank you,
> Sorin
>
> <oXygen/> XML Editor
>
> http://www.oxygenxml.com


More information about the oXygen-user mailing list