XPath version in XSLT 3.0? Weird regex error

Having trouble installing Oxygen? Got a bug to report? Post it all here.
ttasovac
Posts: 87
Joined: Fri Dec 19, 2003 6:02 pm

XPath version in XSLT 3.0? Weird regex error

Post by ttasovac »

Hi.

I need to use various Unicode RegEx classes such as

Code: Select all

 \p{P}
in my XSLT scripts. I can test them in Oxygen's XPath builder and they work as expected with XPath 2.0 and XPath 3.1.

But when I put the same RegEx class inside an XSLT stylesheet (version 3.0), transforming with Saxon, for instance:

Code: Select all

<xsl:template match="P/B[1][matches(., '\p{P}\s*$')]">
</xsl:template>
I'm getting an this error:

Code: Select all

Syntax error at char 8 in regular expression: Expected '{' after \112
with additional info pointing to http://www.w3.org/TR/2005/WD-xpath-func ... RRFORX0002 which seems to indicate some kind of XPath incompatibility. But that doesn't make much sense to me because XSLT 3.0 should be ok with XPath 3.1 and, as I said, this definitely works in Oxygen's XPath builder.

What am I missing here? I'll be most grateful for any tips you may have.

All best,
Toma
teo
Posts: 63
Joined: Wed Aug 30, 2017 3:56 pm

Re: XPath version in XSLT 3.0? Weird regex error

Post by teo »

Hi Toma,

I tested on the latest version of Oxygen (26.1) and did not reproduce the issue.
Additionally, I asked a colleague to test on his workstation as well. Likewise, the transformation went smoothly.
Maybe there's a small detail we're missing here, but I can't figure it out...

Best regards,
Teo
Teodor Timplaru
<oXygen/> XML Editor
http://www.oxygenxml.com
ttasovac
Posts: 87
Joined: Fri Dec 19, 2003 6:02 pm

Re: XPath version in XSLT 3.0? Weird regex error

Post by ttasovac »

Thank you so much for looking into this, Teo. I have since updated to to Mac OS 15 and the latest oXygen, and I am, indeed, no longer having problems with `\p{P}`. But I'm still having issues with Unicode scripts and Unicode blocks in XPath. So I have created an oXygen project that demonstrates the issue: https://github.com/ttasovac/unicode-regex-in-oxygen.

You can view the rest of this post in the README on GitHub — it will be more readable.

Unicode groups assigned character points to blocks and scripts:
  • The regular expression `\p{Cyrillic}+` should match characters that are assigned to the Cyrillic script. Some regex flavors require the `\p{IsCyrillic}+` notation. The Cyrillic _script_ should match all the Cyrillic _blocks_.
  • `\p{InCyrillic}+`, `\p{InCyrillicExtended-A}`, `\p{InCyrillicExtended-B}` should each match the corresponding Cyrillic block.
I'm puzzled by the following inconsistencies in oXygen.

## Script matching

### oXygen search

In oXygen Search, I'm getting the expected results, with `\p{IsCyrillic}+` matching Cyrillic script characters from all three Cyrillic blocks: Cyrillic, Cyrillic Extended-A and Cyrillic Extended-B.

Image

**oXygen search works as expected.**

### XPath matching

In XSLT, `{IsCyrillic}+` will match Cyrillic characters only in the Cyrillic block, but will _not_ match those that are in Cyrillic Extended-A and Cyrillic Extended-B.

Image

**xPath seems to match the block instead of the script.**

## Block matching

### oXygen search

Works as expected. Blocks are correctly matched.

Image

### XPath matching

Block matching doesn't work at all because the character category is not recognized.

Image

How can I make sense of this?
Mircea
Posts: 136
Joined: Tue Mar 25, 2003 11:21 am

Re: XPath version in XSLT 3.0? Weird regex error

Post by Mircea »

Hi Toma,
In XPath/XSLT 3.0, the matches() function uses Unicode-based regular expressions, but the character category used to specify Unicode blocks is done through \p{IsCyrillic} for the basic blocks. Unfortunately, for Cyrillic Extended-A and Cyrillic Extended-B, you cannot use \p{InCyrillicExtended-A} or \p{InCyrillicExtended-B} as these are not standard Unicode categories in XPath.

To check characters from the Cyrillic Extended-A and Cyrillic Extended-B blocks, you will need to use explicit ranges of Unicode code points.
Cyrillic Extended-A has Unicode code points in the range: U+2DE0–U+2DFF.
Cyrillic Extended-B has Unicode code points in the range: U+A640–U+A69F.

Instead of using regular expressions, we use the string-to-codepoints() function to convert each character in the text into its corresponding Unicode code points. We then compare these code points to check if they fall within specific ranges:

Cyrillic Extended-A: Code points are between U+2DE0 (11744) and U+2DFF (11775).
Cyrillic Extended-B: Code points are between U+A640 (42560) and U+A69F (42623).

The proper XSLT for your needs is bellow:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    exclude-result-prefixes="xs math"
    version="3.0">
    
    <xsl:mode on-no-match="shallow-copy"/>
    
    <!-- Cyrillic block -->
    <xsl:template match="B[matches(., '\p{IsCyrillic}+')]" priority="4">
        <B>
            <xsl:text>matched cyliric: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
    
    <xsl:template match="B[not(matches(., '\p{IsCyrillic}+'))]" priority="1">
        <B>
            <xsl:text>not matched cyliric: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
    
    <!-- Cyrillic Extended-A block (U+2DE0–U+2DFF) -->
    <xsl:template match="B[string-to-codepoints(.) ge 11744 and string-to-codepoints(.) le 11775]" priority="3">
        <B>
            <xsl:text>matched A: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
    
    <xsl:template match="B[not(string-to-codepoints(.) ge 11744 and string-to-codepoints(.) le 11775)]" priority="2">
        <B>
            <xsl:text>not matched A: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
    
    <!-- Cyrillic Extended-B block (U+A640–U+A69F) -->
    <xsl:template match="B[string-to-codepoints(.) ge 42560 and string-to-codepoints(.) le 42623]" priority="3">
        <B>
            <xsl:text>matched B: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
    
    <xsl:template match="B[not(string-to-codepoints(.) ge 42560 and string-to-codepoints(.) le 42623)]" priority="2">
        <B>
            <xsl:text>not matched B: </xsl:text>
            <xsl:value-of select="."/>
        </B>
    </xsl:template>
</xsl:stylesheet>
Mircea Enachescu
<oXygen> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
ttasovac
Posts: 87
Joined: Fri Dec 19, 2003 6:02 pm

Re: XPath version in XSLT 3.0? Weird regex error

Post by ttasovac »

Thanks a lot, Mircea!
Post Reply