Searches for quoted phrases are matching topics with no such phrase

Having trouble installing Oxygen XML WebHelp? Got a bug to report? Post it all here.
chrispitude
Posts: 907
Joined: Thu May 02, 2019 2:32 pm

Searches for quoted phrases are matching topics with no such phrase

Post by chrispitude »

We have gotten feedback that sometimes WebHelp searches for quoted phrases will return topics that contain words in the phrase, but not that phrase in its entirety.

For example, in the attached testcase, a search for "user function" returns the following results:

image.png
image.png (57.71 KiB) Viewed 1441 times

but only the second topic actually contains the phrase "user function".

Here's the testcase:

webhelp_quoted_search_phrase.zip
(27.26 KiB) Downloaded 182 times

It contains a text file of word pairs (pairs.txt) and a perl script that can randomly regenerate the testcase with the pairs allocated across any number of topics. For example, the attached testcase was generated with:

Code: Select all

make_pairs_testcase.pl 10 pairs.txt
The issue behavior is strange. It does not seem to be affected much by topic count. Some randomly ordered testcases show the issue and some do not. I haven't been able to get an idea of the underlying cause.

Please let us know the issue ID so we can track it on our side. Thanks!
beniamin_savu
Posts: 31
Joined: Fri Jan 22, 2021 11:05 am

Re: Searches for quoted phrases are matching topics with no such phrase

Post by beniamin_savu »

Hi,

We investigated the test case that you provided. When searching for "user function" the search engine does indeed return the results you mentioned.

This is happening because:
  • topic 7:variable function contains:
    user by
    as function
  • topic 5 lesser fuction contains:
    user will
    not function
To improve performance, the Search indexer excludes the stop words as these words are not relevant. The indexer will remove the stop words and the text "user by as function" and "user will not function" will become "user function". This is why topic7 and topic5 appear in the search results. If you modify these words into something else that is not a stop word then only topic 6 will appear in the search results.

Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
chrispitude
Posts: 907
Joined: Thu May 02, 2019 2:32 pm

Re: Searches for quoted phrases are matching topics with no such phrase

Post by chrispitude »

Hi Beniamin,

Thank you for identifying the root cause of the search matches! I now understand that the search algorithm is working as designed.

However, perhaps it could be improved. Recall that the user's expectation when searching for a quoted phrase is to return instances of that exact phrase. To meet this user expectation,
  1. The quoted phrase should not match when split across stop words.
  2. The quoted phrase should not match when split across block elements.
As a side note, #1 seems to vary across search engines - some match quoted phrases across stop words and some do not.

I don't know how the WebHelp search is implemented internally. If the content is flattened into a linear word list, then perhaps some kind of special "break" character or token could be placed at stop words (maybe?) and between block elements (definitely) to retain these semantics.
chrispitude
Posts: 907
Joined: Thu May 02, 2019 2:32 pm

Re: Searches for quoted phrases are matching topics with no such phrase

Post by chrispitude »

Hi again Beniamin,

To apply the knowledge you gave me, I decided to make a more realistic testcase based on our production content. Given a quoted search for "layout new", I took the expected matching topic and five unexpected matching topics, then I ran an XML find-and-replace within element content with the following regex pattern:

Code: Select all

\b(?!\w*(layout|new))(?!\b(but|be|with|such|then|for|no|will|not|are|and|their|if|this|on|into|a|or|there|in|that|they|was|is|it|an|the|as|at|these|by|to|of)\b)\w+\b
which replaces all words with "WORD", except for stop words and words containing "layout" or "new".

A search for the quoted phrase "layout new" returns the following results:

image.png
image.png (103.64 KiB) Viewed 1404 times

The "layout new" topic is the only topic containing that phrase, and it is ranked lower than the other topics that do not contain that exact phrase.

To identify possible matches from #1 and #2, you can copy and paste the topic text into this online regex testing page, then you can search with the following pattern:

Code: Select all

layout\w*\s+((but|be|with|such|then|for|no|will|not|are|and|their|if|this|on|into|a|or|there|in|that|they|was|is|it|an|the|as|at|these|by|to|of)\s+)*new\w*
which looks for words beginning with "layout" or "new" separated by zero or more stopwords in the flattened topic text. But even with this, I found instances in which the matches cannot be attributed to #1 or #2.

Here's the testcase:

webhelp_search_quoted_phrase2.zip
(16.41 KiB) Downloaded 180 times

To run,
  1. Open the .xpr project file.
  2. Open the .ditamap file.
  3. Run the "Synopsys WebHelp" transformation defined in the .xpr project file.
beniamin_savu
Posts: 31
Joined: Fri Jan 22, 2021 11:05 am

Re: Searches for quoted phrases are matching topics with no such phrase

Post by beniamin_savu »

Hi,

Thank you for your feedback and for providing the test case. It really helps.

Regarding your test case, please note that the dita map name has in its name "layout new" and this text appears in all topics. But unfortunately this does not fix the search result.

While investigating your test case I managed to find an issue. It seems that the phrase search was also giving a match for a phrase in which one of the words starts with the words from the search query. For example
"layout new" was giving a match for "layout newTopic"

Other matches that I detected while investigating your test case:
  • Topic 1

    Code: Select all

    layout. No new // The search indexer excludes punctuation marks and stopwords.
  • Topic 3

    Code: Select all

    layout with new_cell // The search indexer breaks composed words into separate words and assigns them the same position in the index
    
    layout([new_cell]) // The search indexer uses ()[] as separators when breaking a word.
    
    new_layout, 'new_created_layout.WORD' // The search indexer excludes punctuation marks and breaks composed words into separate words.
  • Topic 4

    Code: Select all

    layout with the new // The search indexer excludes stopwords.

I have added an issue (WH-3155) in our internal issue tracker to improve the phrase search . I also have added your feedback on the issue. It will be analyzed by our development team.

Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
chrispitude
Posts: 907
Joined: Thu May 02, 2019 2:32 pm

Re: Searches for quoted phrases are matching topics with no such phrase

Post by chrispitude »

Hi Beniamin,

Thanks on both counts! Those sneaky matches distributed across punctuation, line breaks, and stop words are indeed tricky to find. Thanks for filing the issue for the partial word match, and hopefully some improvements can be made with the unexpected distributed matches too.

- Chris
coryc
Posts: 5
Joined: Mon Apr 13, 2020 5:16 pm

Re: Searches for quoted phrases are matching topics with no such phrase

Post by coryc »

I found that by making the changes below to the consecutive-word test in nwSearchFnt.js, I was able to achieve the expected behavior in both quoted and unquoted phrase searches.

If you drop the attached version of nwSearchFnt.js into [webhelp_output_dir]/oxygen-webhelp/app/search, it seems to correct the problem. The key is to eliminate from one of the candidate-match arrays each element that is not a direct right-sibling of the current word in the phrase. If this is indeed a viable solution, as a bonus it has the potential to improve phrase-search performance ever so slightly.

-coryc
nwSearchFnt.zip
(13.72 KiB) Downloaded 173 times
(Begins at line 539)

Code: Select all

                        var consecutiveIndices = true;
                        // Test if next words indices are consecutive
                        for (var ii = 1; ii < resPerFileArray[i].wordsList.length; ii++) {

                            var nextIndices = resPerFileArray[i].wordsList[ii].indices;
                            var nextIdxFound = false;
                            //for (var nIdx in nextIndices) {  //mod: test nextIndices in reverse order
                            for(var nIdx=nextIndices.length-1;nIdx>-1;nIdx--){
                                var cRes = parseInt(nextIndices[nIdx], 32);

                                if (cRes != -1 && cidx == cRes - 1) {
                                    cidx = cRes;
                                    nextIdxFound = true;
                                    break;
                                } else { // mod: if this current ix not equal to current result, eliminate topic from array
                                    var rem=nextIndices.splice(nIdx, 1);
                                    //console.log("removed element at ",nIdx);
                                }
                            }
beniamin_savu
Posts: 31
Joined: Fri Jan 22, 2021 11:05 am

Re: Searches for quoted phrases are matching topics with no such phrase

Post by beniamin_savu »

Hi,

Thank you for sharing with us your changes. I have added your feedback to our issue in our internal issue tracker to be further analyzed by our development team.

Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
coryc
Posts: 5
Joined: Mon Apr 13, 2020 5:16 pm

Re: Searches for quoted phrases are matching topics with no such phrase

Post by coryc »

The modifications I posted previously turn out to not be a complete solution: That modified nwSearchFnt.js can still return inaccurate results if there are stop words between consecutive terms in the HTML. I've arrived at a more complete solution for quoted-phrase searching that does return accurate results, although it requires prior knowledge of all stop words, which are hard-coded in custom_build_file.xml. It consists of:

1. A modified nwSearchFnt.js file that identifies consecutive words in a quoted phrase by comparing their offset values in index-?.js files. WordA is adjacent to WordB if the offset of WordB is one higher than the offset of WordA.
2. A custom WebHelp Responsive publishing template, "qp_fix/WebHelp Responsive quoted-phrase fix.opt," which defines parameter webhelp.search.stop.words.exclude as the default list of all stopwords generated by the Lucene indexer.
3. A DITA-OT plugin to hook the whr-search-index-post extension point and:
- Strip stop words from the index-?.js files
- Write a stopwords.js file with the default content
- Copy the modified nwSearchFnt.js file to the correct location in the WebHelp output directory.

This approach is necessary because the offset values in the indexes generated by Lucene do not reflect intervening stop words. A more elegant solution would be to directly customize Lucene and correct that. If that were done, only the modified nwSearchFnt.js would be necessary.

-coryc
================================================================
I had to spit this demo into several attachments. To set it up, download all four attachments, and extract dita_sources.7z, project-map.7z, and qp_fix.7z to the same directory. (com.synopsys.webhelp.search-index.zip is the plugin.)

To run the demo in Oxygen:
1. Integrate com.synopsys.webhelp.search-index.zip into DITA-OT
2. Open OPENME1.xpr
3. Open OPENME2.ditamap
4. Edit the Synopsys WebHelp transformation scenario and add qp_fix/WebHelp Responsive quoted-phrase fix.opt through Choose Custom Publishing Template
5. Apply the Synopsys WebHelp transformation to OPENME2.ditamap

To run the demo in Oxygen Publishing Engine:
1. Integrate com.synopsys.webhelp.search-index.zip into DITA-OT
2. Run the following command: dita --project project.xml

A search for the quoted phrase "layout new" returns one result.
A search for the unquoted phrase layout new returns six results.
com.synopsys.webhelp.search-index.zip
(3.37 KiB) Downloaded 112 times
dita_sources.7z
(10.01 KiB) Downloaded 117 times
project-map.7z
(3.2 KiB) Downloaded 120 times
qp_fix.7z
(252.73 KiB) Downloaded 120 times
Post Reply