Oxygen XML Forum

Posted: **Mon Sep 24, 2018 1:40 am**

I have a paragraph with multiple sentences of mixed content. I'd like to tokenize each sentence and wrap them in a <ph> element.

I've tried a few approaches using tokenize, but can't seem to get them to work.

<sqf:replace match="text()">
<xsl:copy>
<xsl:for-each-group select="text()" group-ending-with="text()[matches(., $SEnd)]">
<ph>
<xsl:apply-templates select="current-group()" />
</ph>
</xsl:for-each-group>
</xsl:copy>
</sqf:replace>
<sqf:delete match="ph"/>

For example:
Hi, I am the first sentence. I am the second sentence with mixed content. I am the third sentence with mixed content.

Ideally, I want to delete any sentence with content, but need to leave the other sentences. My thought is to tokenize each sentence and wrap them in a <ph>, then find any <ph> containing a child and delete it.

Any ideas?

Thanks,

--Scott

Posted: **Mon Sep 24, 2018 11:11 am**

Hi Scott,

You can use a regular expression that matches all the text until one of ".", "!", or "?" characters are found. I've created an example of a Schematron quick fix that replaces the paragraph content with a new content where the sentences are wrapped in ph.

Code: Select all


<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"

    xmlns:sqf="http://www.schematron-quickfix.com/validator/process"

    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <sch:pattern>

        <sch:let name="regexp" value="'[^\s+][^.!?]*[.!?]'"/>

        <sch:rule context="p">

            <sch:report test="text()[matches(., $regexp)]" sqf:fix="fix_id"> Paragraph with multiple

                sentences </sch:report>



            <sqf:fix id="fix_id">

                <sqf:description>

                    <sqf:title>Wrap each sentence in a ph element.</sqf:title>

                </sqf:description>

                <sqf:replace>

                    <p>

                        <xsl:analyze-string select="." regex="{$regexp}">

                            <xsl:matching-substring>

                                <ph>

                                    <xsl:value-of select="regex-group(0)"/>

                                </ph>

                            </xsl:matching-substring>

                        </xsl:analyze-string>

                    </p>

                </sqf:replace>

            </sqf:fix>

        </sch:rule>

    </sch:pattern>

</sch:schema>

Best Regards,
Octavian

Posted: **Mon Sep 24, 2018 5:54 pm**

Unfortunately, it looks like the markup in the sentence is getting removed. We need to preserve the markup and delete only the sentence that has .

Posted: **Tue Sep 25, 2018 2:00 pm**

Hi Scott,

It is a little bit more complicated If you want to preserve the markup in the paragraph, and you need to use XSLT code to process the content. I am not an XSLT expert but I can give you some hints.

From the quick fix you need to apply a template that will process the paragraph content.

Code: Select all


 <sqf:replace>

    <p>

        <xsl:apply-templates mode="wrapInPh"/>

    </p>

</sqf:replace>

The XSLT template with the mode "wrapInPh" must be added in the Schematron file on the first level, or you can create a separate XSLT and include it from Schematron.

Here you can find some suggestions made by Michael Kay, to implement the processing template.

Then are two approaches to the problem. Both involve breaking it into smaller problems.

The first approach is to convert the markup into text (for example replace first by first), then use text manipulation operations (xsl:analyze-string) to split it into sentences, and then reconstitute the markup within the sentences.

The second approach is to convert the text delimiters into markup (convert "." to <stop/>) and then use positional grouping techniques (typically <xsl:for-each-group group-ending-with="stop"/> to convert the sentences into paragraphs.)

Best Regards,
Octavian

Oxygen XML Forum

tokenize sentences in a paragraph in Schematron

tokenize sentences in a paragraph in Schematron

Re: tokenize sentences in a paragraph in Schematron

Re: tokenize sentences in a paragraph in Schematron

Re: tokenize sentences in a paragraph in Schematron