Page 1 of 1

tokenize sentences in a paragraph in Schematron

Posted: Mon Sep 24, 2018 1:40 am
by shudson310
I have a paragraph with multiple sentences of mixed content. I'd like to tokenize each sentence and wrap them in a <ph> element.

I've tried a few approaches using tokenize, but can't seem to get them to work.

<sqf:replace match="text()">
<xsl:copy>
<xsl:for-each-group select="text()" group-ending-with="text()[matches(., $SEnd)]">
<ph>
<xsl:apply-templates select="current-group()" />
</ph>
</xsl:for-each-group>
</xsl:copy>
</sqf:replace>
<sqf:delete match="ph"/>

For example:
<p>Hi, I am the first sentence. I am the <b>second sentence</b> with mixed content. I am the <i>third</i> sentence with mixed content.</p>

Ideally, I want to delete any sentence with <b> content, but need to leave the other sentences. My thought is to tokenize each sentence and wrap them in a <ph>, then find any <ph> containing a child <b> and delete it.

Any ideas?

Thanks,

--Scott

Re: tokenize sentences in a paragraph in Schematron

Posted: Mon Sep 24, 2018 11:11 am
by tavy
Hi Scott,

You can use a regular expression that matches all the text until one of ".", "!", or "?" characters are found. I've created an example of a Schematron quick fix that replaces the paragraph content with a new content where the sentences are wrapped in ph.

Code: Select all


<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<sch:pattern>
<sch:let name="regexp" value="'[^\s+][^.!?]*[.!?]'"/>
<sch:rule context="p">
<sch:report test="text()[matches(., $regexp)]" sqf:fix="fix_id"> Paragraph with multiple
sentences </sch:report>

<sqf:fix id="fix_id">
<sqf:description>
<sqf:title>Wrap each sentence in a ph element.</sqf:title>
</sqf:description>
<sqf:replace>
<p>
<xsl:analyze-string select="." regex="{$regexp}">
<xsl:matching-substring>
<ph>
<xsl:value-of select="regex-group(0)"/>
</ph>
</xsl:matching-substring>
</xsl:analyze-string>
</p>
</sqf:replace>
</sqf:fix>
</sch:rule>
</sch:pattern>
</sch:schema>
Best Regards,
Octavian

Re: tokenize sentences in a paragraph in Schematron

Posted: Mon Sep 24, 2018 5:54 pm
by shudson310
Unfortunately, it looks like the markup in the <i> sentence is getting removed. We need to preserve the markup and delete only the sentence that has <b>.

Re: tokenize sentences in a paragraph in Schematron

Posted: Tue Sep 25, 2018 2:00 pm
by tavy
Hi Scott,

It is a little bit more complicated If you want to preserve the markup in the paragraph, and you need to use XSLT code to process the content. I am not an XSLT expert but I can give you some hints.

From the quick fix you need to apply a template that will process the paragraph content.

Code: Select all


 <sqf:replace>
<p>
<xsl:apply-templates mode="wrapInPh"/>
</p>
</sqf:replace>
The XSLT template with the mode "wrapInPh" must be added in the Schematron file on the first level, or you can create a separate XSLT and include it from Schematron.

Here you can find some suggestions made by Michael Kay, to implement the processing template.
Then are two approaches to the problem. Both involve breaking it into smaller problems.

The first approach is to convert the markup into text (for example replace <b>first</b> by first), then use text manipulation operations (xsl:analyze-string) to split it into sentences, and then reconstitute the markup within the sentences.

The second approach is to convert the text delimiters into markup (convert "." to <stop/>) and then use positional grouping techniques (typically <xsl:for-each-group group-ending-with="stop"/> to convert the sentences into paragraphs.)
Best Regards,
Octavian