Searches for quoted phrases are matching topics with no such phrase
Having trouble installing Oxygen XML WebHelp? Got a bug to report? Post it all here.
-
- Posts: 922
- Joined: Thu May 02, 2019 2:32 pm
Searches for quoted phrases are matching topics with no such phrase
Post by chrispitude »
We have gotten feedback that sometimes WebHelp searches for quoted phrases will return topics that contain words in the phrase, but not that phrase in its entirety.
For example, in the attached testcase, a search for "user function" returns the following results:
but only the second topic actually contains the phrase "user function".
Here's the testcase:
It contains a text file of word pairs (pairs.txt) and a perl script that can randomly regenerate the testcase with the pairs allocated across any number of topics. For example, the attached testcase was generated with:
The issue behavior is strange. It does not seem to be affected much by topic count. Some randomly ordered testcases show the issue and some do not. I haven't been able to get an idea of the underlying cause.
Please let us know the issue ID so we can track it on our side. Thanks!
For example, in the attached testcase, a search for "user function" returns the following results:
image.png
but only the second topic actually contains the phrase "user function".
Here's the testcase:
webhelp_quoted_search_phrase.zip
It contains a text file of word pairs (pairs.txt) and a perl script that can randomly regenerate the testcase with the pairs allocated across any number of topics. For example, the attached testcase was generated with:
Code: Select all
make_pairs_testcase.pl 10 pairs.txt
Please let us know the issue ID so we can track it on our side. Thanks!
You do not have the required permissions to view the files attached to this post.
-
- Posts: 38
- Joined: Fri Jan 22, 2021 11:05 am
Re: Searches for quoted phrases are matching topics with no such phrase
Post by beniamin_savu »
Hi,
We investigated the test case that you provided. When searching for "user function" the search engine does indeed return the results you mentioned.
This is happening because:
Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
We investigated the test case that you provided. When searching for "user function" the search engine does indeed return the results you mentioned.
This is happening because:
- topic 7:variable function contains:
user by
as function
- topic 5 lesser fuction contains:
user will
not function
Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
-
- Posts: 922
- Joined: Thu May 02, 2019 2:32 pm
Re: Searches for quoted phrases are matching topics with no such phrase
Post by chrispitude »
Hi Beniamin,
Thank you for identifying the root cause of the search matches! I now understand that the search algorithm is working as designed.
However, perhaps it could be improved. Recall that the user's expectation when searching for a quoted phrase is to return instances of that exact phrase. To meet this user expectation,
I don't know how the WebHelp search is implemented internally. If the content is flattened into a linear word list, then perhaps some kind of special "break" character or token could be placed at stop words (maybe?) and between block elements (definitely) to retain these semantics.
Thank you for identifying the root cause of the search matches! I now understand that the search algorithm is working as designed.
However, perhaps it could be improved. Recall that the user's expectation when searching for a quoted phrase is to return instances of that exact phrase. To meet this user expectation,
- The quoted phrase should not match when split across stop words.
- The quoted phrase should not match when split across block elements.
I don't know how the WebHelp search is implemented internally. If the content is flattened into a linear word list, then perhaps some kind of special "break" character or token could be placed at stop words (maybe?) and between block elements (definitely) to retain these semantics.
-
- Posts: 922
- Joined: Thu May 02, 2019 2:32 pm
Re: Searches for quoted phrases are matching topics with no such phrase
Post by chrispitude »
Hi again Beniamin,
To apply the knowledge you gave me, I decided to make a more realistic testcase based on our production content. Given a quoted search for "layout new", I took the expected matching topic and five unexpected matching topics, then I ran an XML find-and-replace within element content with the following regex pattern:
which replaces all words with "WORD", except for stop words and words containing "layout" or "new".
A search for the quoted phrase "layout new" returns the following results:
The "layout new" topic is the only topic containing that phrase, and it is ranked lower than the other topics that do not contain that exact phrase.
To identify possible matches from #1 and #2, you can copy and paste the topic text into this online regex testing page, then you can search with the following pattern:
which looks for words beginning with "layout" or "new" separated by zero or more stopwords in the flattened topic text. But even with this, I found instances in which the matches cannot be attributed to #1 or #2.
Here's the testcase:
To run,
To apply the knowledge you gave me, I decided to make a more realistic testcase based on our production content. Given a quoted search for "layout new", I took the expected matching topic and five unexpected matching topics, then I ran an XML find-and-replace within element content with the following regex pattern:
Code: Select all
\b(?!\w*(layout|new))(?!\b(but|be|with|such|then|for|no|will|not|are|and|their|if|this|on|into|a|or|there|in|that|they|was|is|it|an|the|as|at|these|by|to|of)\b)\w+\b
A search for the quoted phrase "layout new" returns the following results:
image.png
The "layout new" topic is the only topic containing that phrase, and it is ranked lower than the other topics that do not contain that exact phrase.
To identify possible matches from #1 and #2, you can copy and paste the topic text into this online regex testing page, then you can search with the following pattern:
Code: Select all
layout\w*\s+((but|be|with|such|then|for|no|will|not|are|and|their|if|this|on|into|a|or|there|in|that|they|was|is|it|an|the|as|at|these|by|to|of)\s+)*new\w*
Here's the testcase:
webhelp_search_quoted_phrase2.zip
To run,
- Open the .xpr project file.
- Open the .ditamap file.
- Run the "Synopsys WebHelp" transformation defined in the .xpr project file.
You do not have the required permissions to view the files attached to this post.
-
- Posts: 38
- Joined: Fri Jan 22, 2021 11:05 am
Re: Searches for quoted phrases are matching topics with no such phrase
Post by beniamin_savu »
Hi,
Thank you for your feedback and for providing the test case. It really helps.
Regarding your test case, please note that the dita map name has in its name "layout new" and this text appears in all topics. But unfortunately this does not fix the search result.
While investigating your test case I managed to find an issue. It seems that the phrase search was also giving a match for a phrase in which one of the words starts with the words from the search query. For example
"layout new" was giving a match for "layout newTopic"
Other matches that I detected while investigating your test case:
I have added an issue (WH-3155) in our internal issue tracker to improve the phrase search . I also have added your feedback on the issue. It will be analyzed by our development team.
Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
Thank you for your feedback and for providing the test case. It really helps.
Regarding your test case, please note that the dita map name has in its name "layout new" and this text appears in all topics. But unfortunately this does not fix the search result.
While investigating your test case I managed to find an issue. It seems that the phrase search was also giving a match for a phrase in which one of the words starts with the words from the search query. For example
"layout new" was giving a match for "layout newTopic"
Other matches that I detected while investigating your test case:
- Topic 1
Code: Select all
layout. No new // The search indexer excludes punctuation marks and stopwords.
- Topic 3
Code: Select all
layout with new_cell // The search indexer breaks composed words into separate words and assigns them the same position in the index layout([new_cell]) // The search indexer uses ()[] as separators when breaking a word. new_layout, 'new_created_layout.WORD' // The search indexer excludes punctuation marks and breaks composed words into separate words.
- Topic 4
Code: Select all
layout with the new // The search indexer excludes stopwords.
I have added an issue (WH-3155) in our internal issue tracker to improve the phrase search . I also have added your feedback on the issue. It will be analyzed by our development team.
Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
-
- Posts: 922
- Joined: Thu May 02, 2019 2:32 pm
Re: Searches for quoted phrases are matching topics with no such phrase
Post by chrispitude »
Hi Beniamin,
Thanks on both counts! Those sneaky matches distributed across punctuation, line breaks, and stop words are indeed tricky to find. Thanks for filing the issue for the partial word match, and hopefully some improvements can be made with the unexpected distributed matches too.
- Chris
Thanks on both counts! Those sneaky matches distributed across punctuation, line breaks, and stop words are indeed tricky to find. Thanks for filing the issue for the partial word match, and hopefully some improvements can be made with the unexpected distributed matches too.
- Chris
-
- Posts: 5
- Joined: Mon Apr 13, 2020 5:16 pm
Re: Searches for quoted phrases are matching topics with no such phrase
I found that by making the changes below to the consecutive-word test in nwSearchFnt.js, I was able to achieve the expected behavior in both quoted and unquoted phrase searches.
If you drop the attached version of nwSearchFnt.js into [webhelp_output_dir]/oxygen-webhelp/app/search, it seems to correct the problem. The key is to eliminate from one of the candidate-match arrays each element that is not a direct right-sibling of the current word in the phrase. If this is indeed a viable solution, as a bonus it has the potential to improve phrase-search performance ever so slightly.
-coryc
If you drop the attached version of nwSearchFnt.js into [webhelp_output_dir]/oxygen-webhelp/app/search, it seems to correct the problem. The key is to eliminate from one of the candidate-match arrays each element that is not a direct right-sibling of the current word in the phrase. If this is indeed a viable solution, as a bonus it has the potential to improve phrase-search performance ever so slightly.
-coryc
nwSearchFnt.zip
(Begins at line 539)
Code: Select all
var consecutiveIndices = true;
// Test if next words indices are consecutive
for (var ii = 1; ii < resPerFileArray[i].wordsList.length; ii++) {
var nextIndices = resPerFileArray[i].wordsList[ii].indices;
var nextIdxFound = false;
//for (var nIdx in nextIndices) { //mod: test nextIndices in reverse order
for(var nIdx=nextIndices.length-1;nIdx>-1;nIdx--){
var cRes = parseInt(nextIndices[nIdx], 32);
if (cRes != -1 && cidx == cRes - 1) {
cidx = cRes;
nextIdxFound = true;
break;
} else { // mod: if this current ix not equal to current result, eliminate topic from array
var rem=nextIndices.splice(nIdx, 1);
//console.log("removed element at ",nIdx);
}
}
You do not have the required permissions to view the files attached to this post.
-
- Posts: 38
- Joined: Fri Jan 22, 2021 11:05 am
Re: Searches for quoted phrases are matching topics with no such phrase
Post by beniamin_savu »
Hi,
Thank you for sharing with us your changes. I have added your feedback to our issue in our internal issue tracker to be further analyzed by our development team.
Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
Thank you for sharing with us your changes. I have added your feedback to our issue in our internal issue tracker to be further analyzed by our development team.
Best regards,
Beniamin Savu
Oxygen WebHelp Team
http://www.oxygenxml.com
-
- Posts: 5
- Joined: Mon Apr 13, 2020 5:16 pm
Re: Searches for quoted phrases are matching topics with no such phrase
The modifications I posted previously turn out to not be a complete solution: That modified nwSearchFnt.js can still return inaccurate results if there are stop words between consecutive terms in the HTML. I've arrived at a more complete solution for quoted-phrase searching that does return accurate results, although it requires prior knowledge of all stop words, which are hard-coded in custom_build_file.xml. It consists of:
1. A modified nwSearchFnt.js file that identifies consecutive words in a quoted phrase by comparing their offset values in index-?.js files. WordA is adjacent to WordB if the offset of WordB is one higher than the offset of WordA.
2. A custom WebHelp Responsive publishing template, "qp_fix/WebHelp Responsive quoted-phrase fix.opt," which defines parameter webhelp.search.stop.words.exclude as the default list of all stopwords generated by the Lucene indexer.
3. A DITA-OT plugin to hook the whr-search-index-post extension point and:
- Strip stop words from the index-?.js files
- Write a stopwords.js file with the default content
- Copy the modified nwSearchFnt.js file to the correct location in the WebHelp output directory.
This approach is necessary because the offset values in the indexes generated by Lucene do not reflect intervening stop words. A more elegant solution would be to directly customize Lucene and correct that. If that were done, only the modified nwSearchFnt.js would be necessary.
-coryc
================================================================
I had to spit this demo into several attachments. To set it up, download all four attachments, and extract dita_sources.7z, project-map.7z, and qp_fix.7z to the same directory. (com.synopsys.webhelp.search-index.zip is the plugin.)
To run the demo in Oxygen:
1. Integrate com.synopsys.webhelp.search-index.zip into DITA-OT
2. Open OPENME1.xpr
3. Open OPENME2.ditamap
4. Edit the Synopsys WebHelp transformation scenario and add qp_fix/WebHelp Responsive quoted-phrase fix.opt through Choose Custom Publishing Template
5. Apply the Synopsys WebHelp transformation to OPENME2.ditamap
To run the demo in Oxygen Publishing Engine:
1. Integrate com.synopsys.webhelp.search-index.zip into DITA-OT
2. Run the following command: dita --project project.xml
A search for the quoted phrase "layout new" returns one result.
A search for the unquoted phrase layout new returns six results.
1. A modified nwSearchFnt.js file that identifies consecutive words in a quoted phrase by comparing their offset values in index-?.js files. WordA is adjacent to WordB if the offset of WordB is one higher than the offset of WordA.
2. A custom WebHelp Responsive publishing template, "qp_fix/WebHelp Responsive quoted-phrase fix.opt," which defines parameter webhelp.search.stop.words.exclude as the default list of all stopwords generated by the Lucene indexer.
3. A DITA-OT plugin to hook the whr-search-index-post extension point and:
- Strip stop words from the index-?.js files
- Write a stopwords.js file with the default content
- Copy the modified nwSearchFnt.js file to the correct location in the WebHelp output directory.
This approach is necessary because the offset values in the indexes generated by Lucene do not reflect intervening stop words. A more elegant solution would be to directly customize Lucene and correct that. If that were done, only the modified nwSearchFnt.js would be necessary.
-coryc
================================================================
I had to spit this demo into several attachments. To set it up, download all four attachments, and extract dita_sources.7z, project-map.7z, and qp_fix.7z to the same directory. (com.synopsys.webhelp.search-index.zip is the plugin.)
To run the demo in Oxygen:
1. Integrate com.synopsys.webhelp.search-index.zip into DITA-OT
2. Open OPENME1.xpr
3. Open OPENME2.ditamap
4. Edit the Synopsys WebHelp transformation scenario and add qp_fix/WebHelp Responsive quoted-phrase fix.opt through Choose Custom Publishing Template
5. Apply the Synopsys WebHelp transformation to OPENME2.ditamap
To run the demo in Oxygen Publishing Engine:
1. Integrate com.synopsys.webhelp.search-index.zip into DITA-OT
2. Run the following command: dita --project project.xml
A search for the quoted phrase "layout new" returns one result.
A search for the unquoted phrase layout new returns six results.
com.synopsys.webhelp.search-index.zip
dita_sources.7z
project-map.7z
qp_fix.7z
You do not have the required permissions to view the files attached to this post.
Jump to
- Oxygen XML Editor/Author/Developer
- ↳ Feature Request
- ↳ Common Problems
- ↳ DITA (Editing and Publishing DITA Content)
- ↳ SDK-API, Frameworks - Document Types
- ↳ DocBook
- ↳ TEI
- ↳ XHTML
- ↳ Other Issues
- Oxygen XML Web Author
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Content Fusion
- ↳ Feature Request
- ↳ Common Problems
- Oxygen JSON Editor
- ↳ Feature Request
- ↳ Common Problems
- Oxygen PDF Chemistry
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Feedback
- ↳ Feature Request
- ↳ Common Problems
- Oxygen XML WebHelp
- ↳ Feature Request
- ↳ Common Problems
- XML
- ↳ General XML Questions
- ↳ XSLT and FOP
- ↳ XML Schemas
- ↳ XQuery
- NVDL
- ↳ General NVDL Issues
- ↳ oNVDL Related Issues
- XML Services Market
- ↳ Offer a Service