Oxygen XML Forum

Posted: **Fri Apr 21, 2017 11:42 am**

Dear all,

I have a file where I need to replace some information with tags using regex in the find replace dialogue. I am not able to get the regex however...

The text is like this.

Code: Select all

 <sense>pron. rel. et conj. rel. (gramm. § 147; gramm. § 169,5).

                        A) Pron. rel.: Sing. m. ዘ፡, fem. እንተ፡, Pl. c. እለ፡ <i>qui</i>, <i>quae</i>,

                            <i>quod</i>. 1) De constructione hujus pronominis B) Sx. Sen. 7 Enc.‎

                        </sense>

I need to match all what follows A) and all what follows B). in regexer I found that this expression works fine

((\s)([A-Z])(\))(\s))(.*?)(?=((\s)([A-Z])(\)(\s))|$))

the first match is

Code: Select all

A) Pron. rel.: Sing. m. ዘ፡, fem. እንተ፡, Pl. c. እለ፡ <i>qui</i>, <i>quae</i>,

                            <i>quod</i>. 1) De constructione hujus pronominis

the second

Code: Select all

B) Sx. Sen. 7 Enc.‎

I tried to convert it to the required regex dialect for oXygen, but without success.
I have changed [A-Z] to \p{Upper} as indicated in another forum post and added (?s) to make the match not greedy and restricted the path to the sense element.

((\s)(\p{Upper})(\))(\s))((?s).*?)(?=((\s)(\p{Upper})(\)(\s))|$))

nevertheless this does not work, and I get only
as first match

Code: Select all

A) Pron. rel.: Sing. m. ዘ፡, fem. እንተ፡, Pl. c. እለ፡ <i>qui</i>, <i>quae</i>,

and second match

Code: Select all

B) Sx. Sen. 7 Enc.‎

I do not understand where is my mistake. I have tried changing the greediness without success.

((\s)(\p{Upper})(\))(\s))((?s).*)(?=((\s)(\p{Upper})(\)(\s))|$))

or

((\s)(\p{Upper})(\))(\s))(.*?)(?=((\s)(\p{Upper})(\)(\s))|$))

both return one match only

Code: Select all

 A) Pron. rel.: Sing. m. ዘ፡, fem. እንተ፡, Pl. c. እለ፡ <i>qui</i>, <i>quae</i>,

                            <i>quod</i>. 1) De constructione hujus pronominis B) Sx. Sen. 7 Enc.‎

thanks a lot for any advise or help on how to make this work!

Posted: **Fri Apr 21, 2017 3:30 pm**

Hi,

I started with your initial regexp and cleaned up the redundant parenthesis (kept just a few):

Code: Select all

\s[A-Z]\)\s(.*?)(?=(\s[A-Z]\)\s)|$)

First thing to note is that in Oxygen '.' (dot) matches any character except line terminators. You can make it match everything, by checking the option "Dot matches all", or you can add at the beginning of your expression the flag (?s).

The problem however is the '$' (dollar) in the expression. Some regexp engines interpret $ as EOF, others as EOL. Oxygen is in the latter category (EOL).
Since your match is lazy (.*?) it stops after finding the shortest match (ending in EOL).
I don't have a proper solution for this one. The problem is you want your first match to span across multiple lines, ignoring the line terminator, but want your second match to end at the line terminator. As far as I can tell, you can't have it both ways. With a lazy match either all matches end at line terminators or none do (I picked the latter).

What you can do is use \z (end of input) instead of $ (EOL) and you get this:

Code: Select all

(?s)\s[A-Z]\)\s(.*?)((?=(\s[A-Z]\)\s)|\z))

This finds the first match as you expect, but the second spans all the way to the end of the file (includes the end tag).

Regards,
Adrian

Posted: **Fri Apr 21, 2017 5:17 pm**

Thank you very much for the clarification! I will fid a workaround, your answer already helps a lot!

Oxygen XML Forum

regex in find replace

regex in find replace

Re: regex in find replace

Re: regex in find replace