XPath version in XSLT 3.0? Weird regex error

Post by **ttasovac** » Sat Sep 14, 2024 9:16 pm

Hi.

I need to use various Unicode RegEx classes such as

 \p{P}

in my XSLT scripts. I can test them in Oxygen's XPath builder and they work as expected with XPath 2.0 and XPath 3.1.

But when I put the same RegEx class inside an XSLT stylesheet (version 3.0), transforming with Saxon, for instance:

Code: Select all

<xsl:template match="P/B[1][matches(., '\p{P}\s*$')]">
</xsl:template>

I'm getting an this error:

Code: Select all

Syntax error at char 8 in regular expression: Expected '{' after \112

with additional info pointing to http://www.w3.org/TR/2005/WD-xpath-func ... RRFORX0002 which seems to indicate some kind of XPath incompatibility. But that doesn't make much sense to me because XSLT 3.0 should be ok with XPath 3.1 and, as I said, this definitely works in Oxygen's XPath builder.

What am I missing here? I'll be most grateful for any tips you may have.

All best,
Toma

Post by **teo** » Tue Sep 17, 2024 4:53 pm

Hi Toma,

I tested on the latest version of Oxygen (26.1) and did not reproduce the issue.
Additionally, I asked a colleague to test on his workstation as well. Likewise, the transformation went smoothly.
Maybe there's a small detail we're missing here, but I can't figure it out...

Best regards,
Teo

Post by **ttasovac** » Sun Sep 22, 2024 9:14 am

Thank you so much for looking into this, Teo. I have since updated to to Mac OS 15 and the latest oXygen, and I am, indeed, no longer having problems with `\p{P}`. But I'm still having issues with Unicode scripts and Unicode blocks in XPath. So I have created an oXygen project that demonstrates the issue: https://github.com/ttasovac/unicode-regex-in-oxygen.

You can view the rest of this post in the README on GitHub — it will be more readable.

Unicode groups assigned character points to blocks and scripts:

The regular expression `\p{Cyrillic}+` should match characters that are assigned to the Cyrillic script. Some regex flavors require the `\p{IsCyrillic}+` notation. The Cyrillic _script_ should match all the Cyrillic _blocks_.

`\p{InCyrillic}+`, `\p{InCyrillicExtended-A}`, `\p{InCyrillicExtended-B}` should each match the corresponding Cyrillic block.

I'm puzzled by the following inconsistencies in oXygen.

## Script matching

### oXygen search

In oXygen Search, I'm getting the expected results, with `\p{IsCyrillic}+` matching Cyrillic script characters from all three Cyrillic blocks: Cyrillic, Cyrillic Extended-A and Cyrillic Extended-B.

**oXygen search works as expected.**

### XPath matching

In XSLT, `{IsCyrillic}+` will match Cyrillic characters only in the Cyrillic block, but will _not_ match those that are in Cyrillic Extended-A and Cyrillic Extended-B.

**xPath seems to match the block instead of the script.**

## Block matching

### oXygen search

Works as expected. Blocks are correctly matched.

### XPath matching

Block matching doesn't work at all because the character category is not recognized.

How can I make sense of this?

Post by **Mircea** » Tue Sep 24, 2024 2:36 pm

Hi Toma,
In XPath/XSLT 3.0, the matches() function uses Unicode-based regular expressions, but the character category used to specify Unicode blocks is done through \p{IsCyrillic} for the basic blocks. Unfortunately, for Cyrillic Extended-A and Cyrillic Extended-B, you cannot use \p{InCyrillicExtended-A} or \p{InCyrillicExtended-B} as these are not standard Unicode categories in XPath.

To check characters from the Cyrillic Extended-A and Cyrillic Extended-B blocks, you will need to use explicit ranges of Unicode code points.
Cyrillic Extended-A has Unicode code points in the range: U+2DE0–U+2DFF.
Cyrillic Extended-B has Unicode code points in the range: U+A640–U+A69F.

Instead of using regular expressions, we use the string-to-codepoints() function to convert each character in the text into its corresponding Unicode code points. We then compare these code points to check if they fall within specific ranges:

Cyrillic Extended-A: Code points are between U+2DE0 (11744) and U+2DFF (11775).
Cyrillic Extended-B: Code points are between U+A640 (42560) and U+A69F (42623).

The proper XSLT for your needs is bellow:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    exclude-result-prefixes="xs math"
    version="3.0">
    
    <xsl:mode on-no-match="shallow-copy"/>
    
    <!-- Cyrillic block -->
    <xsl:template match="B[matches(., '\p{IsCyrillic}+')]" priority="4">
        <B>
            <xsl:text>matched cyliric: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
    
    <xsl:template match="B[not(matches(., '\p{IsCyrillic}+'))]" priority="1">
        <B>
            <xsl:text>not matched cyliric: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
    
    <!-- Cyrillic Extended-A block (U+2DE0–U+2DFF) -->
    <xsl:template match="B[string-to-codepoints(.) ge 11744 and string-to-codepoints(.) le 11775]" priority="3">
        <B>
            <xsl:text>matched A: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
    
    <xsl:template match="B[not(string-to-codepoints(.) ge 11744 and string-to-codepoints(.) le 11775)]" priority="2">
        <B>
            <xsl:text>not matched A: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
    
    <!-- Cyrillic Extended-B block (U+A640–U+A69F) -->
    <xsl:template match="B[string-to-codepoints(.) ge 42560 and string-to-codepoints(.) le 42623]" priority="3">
        <B>
            <xsl:text>matched B: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
    
    <xsl:template match="B[not(string-to-codepoints(.) ge 42560 and string-to-codepoints(.) le 42623)]" priority="2">
        <B>
            <xsl:text>not matched B: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
</xsl:stylesheet>

Post by **ttasovac** » Sun Sep 29, 2024 3:49 pm

Thanks a lot, Mircea!

XPath version in XSLT 3.0? Weird regex error

XPath version in XSLT 3.0? Weird regex error

Re: XPath version in XSLT 3.0? Weird regex error

Re: XPath version in XSLT 3.0? Weird regex error

Re: XPath version in XSLT 3.0? Weird regex error

Re: XPath version in XSLT 3.0? Weird regex error