Searching for a <li> tag that didn't close

BenDupre · Post by **BenDupre** » Thu May 11, 2023 7:37 pm

I tried this regex in the search box
<li>.*?</(?=ul>)
looking for an unclosed <li> which I seem to have somewhere in my data set.
The *? quantifier is acting greedy and grabbing everything up to the closing </ul> tag. Can anyone tell me why? Or how to fix?
THANKS

Post by **adrian** » Fri May 12, 2023 1:35 pm

Hi,

This: <li>.*?</(?=ul>) searches for something that starts with "<li>" and ends in "</ul>" (first occurrence) while not including in the match the "ul>" string. So it's not greedy. Greedy means choosing the longest string that can match, expanding to the right as much as possible (last "</ul>" occurrence), but what this one actually finds is the first "<li>" and ends in first "</ul>", which is not what you seem to be looking for.

Anyway, what you want is to skip correct <li>.*</li> pairs> from the match. So, check the box for the option [x] Dot matches all and try:

Code: Select all

<li>((?!</li>).)*(?=(</ul>|<li>))

Here's a breakdown of the regex:

<li>: matches the <li> tag
((?!</li>).)*: matches any character that is not the start of a </li> tag, zero or more times. The negative lookahead (?!</li>) ensures that the </li> tag does not immediately follow the current position. This is what excludes the correct <li>.*</li> pairs.
(?=(</ul>|<li>)): is a zero width positive lookahead that matches either the </ul> end tag or a new <li> start tag.

BTW, have you considered using Oxygen's XML validation (or Check Well-Formedness) operation to identify XML well-form errors?

Regards,
Adrian

Searching for a <li> tag that didn't close

Searching for a <li> tag that didn't close

Re: Searching for a <li> tag that didn't close