tokenize sentences in a paragraph in Schematron

Having trouble installing Oxygen? Got a bug to report? Post it all here.
shudson310
Posts: 156
Joined: Sat Feb 26, 2005 12:09 am
Location: USA
Contact:

tokenize sentences in a paragraph in Schematron

Post by shudson310 »

I have a paragraph with multiple sentences of mixed content. I'd like to tokenize each sentence and wrap them in a <ph> element.

I've tried a few approaches using tokenize, but can't seem to get them to work.

<sqf:replace match="text()">
<xsl:copy>
<xsl:for-each-group select="text()" group-ending-with="text()[matches(., $SEnd)]">
<ph>
<xsl:apply-templates select="current-group()" />
</ph>
</xsl:for-each-group>
</xsl:copy>
</sqf:replace>
<sqf:delete match="ph"/>

For example:
<p>Hi, I am the first sentence. I am the <b>second sentence</b> with mixed content. I am the <i>third</i> sentence with mixed content.</p>

Ideally, I want to delete any sentence with <b> content, but need to leave the other sentences. My thought is to tokenize each sentence and wrap them in a <ph>, then find any <ph> containing a child <b> and delete it.

Any ideas?

Thanks,

--Scott
Scott Hudson
Staff Content Engineer
Site: docs.servicenow.com
tavy
Posts: 365
Joined: Thu Jul 01, 2004 12:29 pm

Re: tokenize sentences in a paragraph in Schematron

Post by tavy »

Hi Scott,

You can use a regular expression that matches all the text until one of ".", "!", or "?" characters are found. I've created an example of a Schematron quick fix that replaces the paragraph content with a new content where the sentences are wrapped in ph.

Code: Select all


<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<sch:pattern>
<sch:let name="regexp" value="'[^\s+][^.!?]*[.!?]'"/>
<sch:rule context="p">
<sch:report test="text()[matches(., $regexp)]" sqf:fix="fix_id"> Paragraph with multiple
sentences </sch:report>

<sqf:fix id="fix_id">
<sqf:description>
<sqf:title>Wrap each sentence in a ph element.</sqf:title>
</sqf:description>
<sqf:replace>
<p>
<xsl:analyze-string select="." regex="{$regexp}">
<xsl:matching-substring>
<ph>
<xsl:value-of select="regex-group(0)"/>
</ph>
</xsl:matching-substring>
</xsl:analyze-string>
</p>
</sqf:replace>
</sqf:fix>
</sch:rule>
</sch:pattern>
</sch:schema>
Best Regards,
Octavian
Octavian Nadolu
<oXygen/> XML Editor
http://www.oxygenxml.com
shudson310
Posts: 156
Joined: Sat Feb 26, 2005 12:09 am
Location: USA
Contact:

Re: tokenize sentences in a paragraph in Schematron

Post by shudson310 »

Unfortunately, it looks like the markup in the <i> sentence is getting removed. We need to preserve the markup and delete only the sentence that has <b>.
Scott Hudson
Staff Content Engineer
Site: docs.servicenow.com
tavy
Posts: 365
Joined: Thu Jul 01, 2004 12:29 pm

Re: tokenize sentences in a paragraph in Schematron

Post by tavy »

Hi Scott,

It is a little bit more complicated If you want to preserve the markup in the paragraph, and you need to use XSLT code to process the content. I am not an XSLT expert but I can give you some hints.

From the quick fix you need to apply a template that will process the paragraph content.

Code: Select all


 <sqf:replace>
<p>
<xsl:apply-templates mode="wrapInPh"/>
</p>
</sqf:replace>
The XSLT template with the mode "wrapInPh" must be added in the Schematron file on the first level, or you can create a separate XSLT and include it from Schematron.

Here you can find some suggestions made by Michael Kay, to implement the processing template.
Then are two approaches to the problem. Both involve breaking it into smaller problems.

The first approach is to convert the markup into text (for example replace <b>first</b> by first), then use text manipulation operations (xsl:analyze-string) to split it into sentences, and then reconstitute the markup within the sentences.

The second approach is to convert the text delimiters into markup (convert "." to <stop/>) and then use positional grouping techniques (typically <xsl:for-each-group group-ending-with="stop"/> to convert the sentences into paragraphs.)
Best Regards,
Octavian
Octavian Nadolu
<oXygen/> XML Editor
http://www.oxygenxml.com
Post Reply