Page 1 of 1

odd regex behavior

Posted: Wed Apr 09, 2025 3:15 pm
by RIH
I'm running XQuery 3.1 with oXygen 26.1 / Saxon EE 12.3 and getting some surprising behavior when using regular expressions. It seems that the regex engine is treating optional capturing groups as non-capturing groups? Am I missing something?

To reproduce: In either the XPath dialog or the XPath/XQuery Builder dialog or in an XQuery script, with XPath/XQuery version set to 3.1 in either case, do

Code: Select all

replace("abc1234-5678abcd", "(.+)(\d{4})(\-\d{4})(.*)", "$2", "i")
which returns the expected "1234". But when you make one of the groups optional and try to return it, e.g.

Code: Select all

replace("abc1234-5678abcd", "(.+)(\d{4})?(\-\d{4})(.*)", "$2", "i")
it returns blank. I expected it to return "1234", since a match is present. It strikes me as a bug, but my apologies if I'm just thinking about the regex wrong!

Re: odd regex behavior

Posted: Thu Apr 10, 2025 10:04 am
by teo
Hello,

It seems you can get a detailed explanation of the reported case by invoking the help of AI.
Copy the message you posted above and then paste it into the Chat GPT window, for example.
The response also includes a fix/workaround proposal.

Additional note: the response received from Chat GPT is very well HTML formatted, easy to read and understand.
I would have lost some of its clarity if I had posted it directly here.

Regards,
Teo

Re: odd regex behavior

Posted: Thu Apr 10, 2025 2:45 pm
by RIH
Thank you. AI says to use lookahead. I understand that's a possibility. Still though, it seems something has changed in the regex engine for it to not capture optional capturing groups if the pattern is present?

Update for the possible benefit of others: Answering my own question with the excellent guidance on regular-expressions.info:

oXygen does support backreferences to non-participating capturing groups (in fact, with or without the Saxon 'j' flag per my experimentation).

As to my problem, I needed to make the quantifier in non-optional capturing group 1 lazy, else it consumes the characters I expected to be matched by optional capturing group 2.

Thanks again for your assistance.