Page 1 of 1

Searching for a <li> tag that didn't close

Posted: Thu May 11, 2023 7:37 pm
by BenDupre
I tried this regex in the search box
<li>.*?</(?=ul>)
looking for an unclosed <li> which I seem to have somewhere in my data set.
The *? quantifier is acting greedy and grabbing everything up to the closing </ul> tag. Can anyone tell me why? Or how to fix?
THANKS

Re: Searching for a <li> tag that didn't close

Posted: Fri May 12, 2023 1:35 pm
by adrian
Hi,

This: <li>.*?</(?=ul>) searches for something that starts with "<li>" and ends in "</ul>" (first occurrence) while not including in the match the "ul>" string. So it's not greedy. Greedy means choosing the longest string that can match, expanding to the right as much as possible (last "</ul>" occurrence), but what this one actually finds is the first "<li>" and ends in first "</ul>", which is not what you seem to be looking for.

Anyway, what you want is to skip correct <li>.*</li> pairs> from the match. So, check the box for the option [x] Dot matches all and try:

Code: Select all

<li>((?!</li>).)*(?=(</ul>|<li>))
Here's a breakdown of the regex:
  • <li>: matches the <li> tag
  • ((?!</li>).)*: matches any character that is not the start of a </li> tag, zero or more times. The negative lookahead (?!</li>) ensures that the </li> tag does not immediately follow the current position. This is what excludes the correct <li>.*</li> pairs.
  • (?=(</ul>|<li>)): is a zero width positive lookahead that matches either the </ul> end tag or a new <li> start tag.
BTW, have you considered using Oxygen's XML validation (or Check Well-Formedness) operation to identify XML well-form errors?

Regards,
Adrian