XPath version in XSLT 3.0? Weird regex error
Having trouble installing Oxygen? Got a bug to report? Post it all here.
XPath version in XSLT 3.0? Weird regex error
Hi.
I need to use various Unicode RegEx classes such as in my XSLT scripts. I can test them in Oxygen's XPath builder and they work as expected with XPath 2.0 and XPath 3.1.
But when I put the same RegEx class inside an XSLT stylesheet (version 3.0), transforming with Saxon, for instance:
I'm getting an this error:
with additional info pointing to http://www.w3.org/TR/2005/WD-xpath-func ... RRFORX0002 which seems to indicate some kind of XPath incompatibility. But that doesn't make much sense to me because XSLT 3.0 should be ok with XPath 3.1 and, as I said, this definitely works in Oxygen's XPath builder.
What am I missing here? I'll be most grateful for any tips you may have.
All best,
Toma
I need to use various Unicode RegEx classes such as
Code: Select all
\p{P}
But when I put the same RegEx class inside an XSLT stylesheet (version 3.0), transforming with Saxon, for instance:
Code: Select all
<xsl:template match="P/B[1][matches(., '\p{P}\s*$')]">
</xsl:template>
Code: Select all
Syntax error at char 8 in regular expression: Expected '{' after \112
What am I missing here? I'll be most grateful for any tips you may have.
All best,
Toma
Re: XPath version in XSLT 3.0? Weird regex error
Hi Toma,
I tested on the latest version of Oxygen (26.1) and did not reproduce the issue.
Additionally, I asked a colleague to test on his workstation as well. Likewise, the transformation went smoothly.
Maybe there's a small detail we're missing here, but I can't figure it out...
Best regards,
Teo
I tested on the latest version of Oxygen (26.1) and did not reproduce the issue.
Additionally, I asked a colleague to test on his workstation as well. Likewise, the transformation went smoothly.
Maybe there's a small detail we're missing here, but I can't figure it out...
Best regards,
Teo
Teodor Timplaru
<oXygen/> XML Editor
http://www.oxygenxml.com
<oXygen/> XML Editor
http://www.oxygenxml.com
Re: XPath version in XSLT 3.0? Weird regex error
Thank you so much for looking into this, Teo. I have since updated to to Mac OS 15 and the latest oXygen, and I am, indeed, no longer having problems with `\p{P}`. But I'm still having issues with Unicode scripts and Unicode blocks in XPath. So I have created an oXygen project that demonstrates the issue: https://github.com/ttasovac/unicode-regex-in-oxygen.
You can view the rest of this post in the README on GitHub — it will be more readable.
Unicode groups assigned character points to blocks and scripts:
## Script matching
### oXygen search
In oXygen Search, I'm getting the expected results, with `\p{IsCyrillic}+` matching Cyrillic script characters from all three Cyrillic blocks: Cyrillic, Cyrillic Extended-A and Cyrillic Extended-B.
**oXygen search works as expected.**
### XPath matching
In XSLT, `{IsCyrillic}+` will match Cyrillic characters only in the Cyrillic block, but will _not_ match those that are in Cyrillic Extended-A and Cyrillic Extended-B.
**xPath seems to match the block instead of the script.**
## Block matching
### oXygen search
Works as expected. Blocks are correctly matched.
### XPath matching
Block matching doesn't work at all because the character category is not recognized.
How can I make sense of this?
You can view the rest of this post in the README on GitHub — it will be more readable.
Unicode groups assigned character points to blocks and scripts:
- The regular expression `\p{Cyrillic}+` should match characters that are assigned to the Cyrillic script. Some regex flavors require the `\p{IsCyrillic}+` notation. The Cyrillic _script_ should match all the Cyrillic _blocks_.
- `\p{InCyrillic}+`, `\p{InCyrillicExtended-A}`, `\p{InCyrillicExtended-B}` should each match the corresponding Cyrillic block.
## Script matching
### oXygen search
In oXygen Search, I'm getting the expected results, with `\p{IsCyrillic}+` matching Cyrillic script characters from all three Cyrillic blocks: Cyrillic, Cyrillic Extended-A and Cyrillic Extended-B.
**oXygen search works as expected.**
### XPath matching
In XSLT, `{IsCyrillic}+` will match Cyrillic characters only in the Cyrillic block, but will _not_ match those that are in Cyrillic Extended-A and Cyrillic Extended-B.
**xPath seems to match the block instead of the script.**
## Block matching
### oXygen search
Works as expected. Blocks are correctly matched.
### XPath matching
Block matching doesn't work at all because the character category is not recognized.
How can I make sense of this?
Re: XPath version in XSLT 3.0? Weird regex error
Hi Toma,
In XPath/XSLT 3.0, the matches() function uses Unicode-based regular expressions, but the character category used to specify Unicode blocks is done through \p{IsCyrillic} for the basic blocks. Unfortunately, for Cyrillic Extended-A and Cyrillic Extended-B, you cannot use \p{InCyrillicExtended-A} or \p{InCyrillicExtended-B} as these are not standard Unicode categories in XPath.
To check characters from the Cyrillic Extended-A and Cyrillic Extended-B blocks, you will need to use explicit ranges of Unicode code points.
Cyrillic Extended-A has Unicode code points in the range: U+2DE0–U+2DFF.
Cyrillic Extended-B has Unicode code points in the range: U+A640–U+A69F.
Instead of using regular expressions, we use the string-to-codepoints() function to convert each character in the text into its corresponding Unicode code points. We then compare these code points to check if they fall within specific ranges:
Cyrillic Extended-A: Code points are between U+2DE0 (11744) and U+2DFF (11775).
Cyrillic Extended-B: Code points are between U+A640 (42560) and U+A69F (42623).
The proper XSLT for your needs is bellow:
In XPath/XSLT 3.0, the matches() function uses Unicode-based regular expressions, but the character category used to specify Unicode blocks is done through \p{IsCyrillic} for the basic blocks. Unfortunately, for Cyrillic Extended-A and Cyrillic Extended-B, you cannot use \p{InCyrillicExtended-A} or \p{InCyrillicExtended-B} as these are not standard Unicode categories in XPath.
To check characters from the Cyrillic Extended-A and Cyrillic Extended-B blocks, you will need to use explicit ranges of Unicode code points.
Cyrillic Extended-A has Unicode code points in the range: U+2DE0–U+2DFF.
Cyrillic Extended-B has Unicode code points in the range: U+A640–U+A69F.
Instead of using regular expressions, we use the string-to-codepoints() function to convert each character in the text into its corresponding Unicode code points. We then compare these code points to check if they fall within specific ranges:
Cyrillic Extended-A: Code points are between U+2DE0 (11744) and U+2DFF (11775).
Cyrillic Extended-B: Code points are between U+A640 (42560) and U+A69F (42623).
The proper XSLT for your needs is bellow:
Code: Select all
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<!-- Cyrillic block -->
<xsl:template match="B[matches(., '\p{IsCyrillic}+')]" priority="4">
<B>
<xsl:text>matched cyliric: </xsl:text>
<xsl:value-of select="."/>
</B>
</xsl:template>
<xsl:template match="B[not(matches(., '\p{IsCyrillic}+'))]" priority="1">
<B>
<xsl:text>not matched cyliric: </xsl:text>
<xsl:value-of select="."/>
</B>
</xsl:template>
<!-- Cyrillic Extended-A block (U+2DE0–U+2DFF) -->
<xsl:template match="B[string-to-codepoints(.) ge 11744 and string-to-codepoints(.) le 11775]" priority="3">
<B>
<xsl:text>matched A: </xsl:text>
<xsl:value-of select="."/>
</B>
</xsl:template>
<xsl:template match="B[not(string-to-codepoints(.) ge 11744 and string-to-codepoints(.) le 11775)]" priority="2">
<B>
<xsl:text>not matched A: </xsl:text>
<xsl:value-of select="."/>
</B>
</xsl:template>
<!-- Cyrillic Extended-B block (U+A640–U+A69F) -->
<xsl:template match="B[string-to-codepoints(.) ge 42560 and string-to-codepoints(.) le 42623]" priority="3">
<B>
<xsl:text>matched B: </xsl:text>
<xsl:value-of select="."/>
</B>
</xsl:template>
<xsl:template match="B[not(string-to-codepoints(.) ge 42560 and string-to-codepoints(.) le 42623)]" priority="2">
<B>
<xsl:text>not matched B: </xsl:text>
<xsl:value-of select="."/>
</B>
</xsl:template>
</xsl:stylesheet>
Mircea Enachescu
<oXygen> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
<oXygen> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Jump to
- Oxygen XML Editor/Author/Developer
- ↳ Feature Request
- ↳ Common Problems
- ↳ DITA (Editing and Publishing DITA Content)
- ↳ SDK-API, Frameworks - Document Types
- ↳ DocBook
- ↳ TEI
- ↳ XHTML
- ↳ Other Issues
- Oxygen XML Web Author
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Content Fusion
- ↳ Feature Request
- ↳ Common Problems
- Oxygen JSON Editor
- ↳ Feature Request
- ↳ Common Problems
- Oxygen PDF Chemistry
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Feedback
- ↳ Feature Request
- ↳ Common Problems
- Oxygen XML WebHelp
- ↳ Feature Request
- ↳ Common Problems
- XML
- ↳ General XML Questions
- ↳ XSLT and FOP
- ↳ XML Schemas
- ↳ XQuery
- NVDL
- ↳ General NVDL Issues
- ↳ oNVDL Related Issues
- XML Services Market
- ↳ Offer a Service