Tokenization and scoring.properties

Post by **galanohan** » Fri Feb 02, 2024 12:56 pm

Hi,

I've customized the scoring properties file under \Oxygen XML Editor 26\frameworks\dita\DITA-OT\plugins\com.oxygenxml.webhelp.responsive\indexer to increase search weight for content under certain HTML tags like title, meta_keyword, meta_indexterm.
Something like this:

h1 = 999
h2 = 40
h3 = 30
h4 = 20
h5 = 20
h6 = 20
b = 20
strong = 5
em = 5
i = 5
u = 5
p = 50
div.toc = 100
title = 1099
div.ignore = ignored
meta_keywords = 999
meta_indexterms = 999
meta_description = 1
shortdesc = 10

However, when searching certain topics with part of the title, for example, we have a function named "isIndexedMatrix", its content is as follows:

image.png

in the webhelp, if somebody search with "IndexedMatrix", no result is returned. They have to input "isIndexedMatrix" to get the search result. But for new users who can't remember the full name of the function, they might enter "Index" or "Matrix" or "IndexedMatrix", but none of them return the "isIndexedMatrix" page in search results.

Of course, I can added more meta_keyword and meta_indexterm for this topic in the map, but we have over 1700 functions and over 300 other topics, it'll get us crazy to add those 2000+ topicmeta to increase the search weight manually.

So, is it possible to add "code" or "codeph" to the scoring.properties file and assign a big value for it to increase the weight? because every function starts with a <codeph/> element in the syntax section.

Or, which I think might be more efficient, the tokenization, which is the process of breaking a text into individual units, such as words or sentences. For example, for html pages generated from Markdown and reStructuredText with Sphinx, there are libraries and tools that can be used for tokenization, such as the markdown and recommonmark libraries for Python. These libraries can help with parsing and processing Markdown and reStructuredText documents.

Does oxygen xml editor support tokenization?

Post by **galanohan** » Fri Feb 02, 2024 1:29 pm

I took a search in the forun, then found the following thread: post47728.html?hilit=tokenization#p47728 and relevant maillist archive: https://www.oxygenxml.com/archives/xsl- ... 00939.html

Is it the right way to customize a similar template for tokenization, how do I specify it in the .opt?

I used chatgpt to write a tokenization template like this:

------------------------------------------------------
Sure, here's an example of a tokenization template in XML and an XSL stylesheet to break a string into sub-strings:

XML template (input.xml):
```xml
<?xml version="1.0" encoding="UTF-8"?>
<root>
<searchWord>Hello World</searchWord>
</root>
```

XSL stylesheet (tokenize.xsl):
```xml
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>

<xsl:template match="/">
<tokenizedWords>
<xsl:call-template name="tokenizeString">
<xsl:with-param name="input" select="root/searchWord"/>
</xsl:call-template>
</tokenizedWords>
</xsl:template>

<xsl:template name="tokenizeString">
<xsl:param name="input"/>
<xsl:param name="delimiter" select="' '"/>

<xsl:choose>
<xsl:when test="contains($input, $delimiter)">
<word>
<xsl:value-of select="substring-before($input, $delimiter)"/>
</word>
<xsl:call-template name="tokenizeString">
<xsl:with-param name="input" select="substring-after($input, $delimiter)"/>
<xsl:with-param name="delimiter" select="$delimiter"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<word>
<xsl:value-of select="$input"/>
</word>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
```

To use this XSL stylesheet to tokenize the search word "Hello World", you can use an XSLT processor to apply the transformation. For example, using the `xsltproc` command line tool:

```bash
xsltproc tokenize.xsl input.xml
```

This will produce the following output:
```xml
<?xml version="1.0"?>
<tokenizedWords>
<word>Hello</word>
<word>World</word>
</tokenizedWords>
```

This XSL stylesheet tokenizes the input string by breaking it into sub-strings based on the space character. You can modify the delimiter parameter in the `tokenizeString` template to use a different delimiter if needed.

-----------------------

is it how it works?

Tokenization and scoring.properties

Tokenization and scoring.properties

Re: Tokenization and scoring.properties