XML Diff: Example of non-optimal character compare

Having trouble installing Oxygen? Got a bug to report? Post it all here.
dma_k
Posts: 32
Joined: Fri Aug 05, 2011 8:27 pm

XML Diff: Example of non-optimal character compare

Post by dma_k »

In the shown example the comparison of the line is not optimal:

The letter f is taken from href, so the sequence is "fragmented". It is expected that is is less fragmented.

Image

Text on the left:

Code: Select all

<?xml-stylesheet type="text/xsl" href="example.xsl"?>
<?oxygen NVDLSchema="example.nvdl"?>
<fulltext-document xmlns="http://www.epo.org/fulltext" system="ops.epo.org"
text on the right:

Code: Select all

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><fulltext-document xmlns="http://www.epo.org/fulltext" system="ops.epo.org"
adrian
Posts: 2879
Joined: Tue May 17, 2005 4:01 pm

Re: XML Diff: Example of non-optimal character compare

Post by adrian »

Hi,

Please note that you are using the Characters diff algorithm which tries to match anything at character level and is not XML-aware(this algorithm is pretty much useless when comparing XML documents). In your case it found f on both sides so it considered it identical.

Try to use a more advanced diff algorithm, at least Words or Lines, or better yet leave it on Auto and it will automatically detect the appropriate algorithm(XML Fast or XML Accurate) for the content type of each file.

Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
dma_k
Posts: 32
Joined: Fri Aug 05, 2011 8:27 pm

Re: XML Diff: Example of non-optimal character compare

Post by dma_k »

Adrian,

Thanks for your reply. I am attaching the test set (test.zip) which I also used in some other post.

Indeed, I am not using the full power of XML comparison of XML Diff and that is because of the specific of the files I am working on. When I switch to "Words" or "Lines" the comparison (at least visually) become worse: e.g. the algorithm thinks that " (1998) " should be replaced by " (<date>1986</date>) " (in reality this change is split into two). Of course that is correct but it would be ideal if it shows that "1998" was replaced by "<date>1986</date>" which is more compact, and thus, more efficient diff. So ideally the document US2004248206A1_1.xml should have no changes like removed or replaced, as all information was inserted into it. Perhaps the word comparison can be improved?

I also think that with line comparison, the "Block of changes" should be a line (relative to this). Currently the algorithm joins all consequent lines, which may end up with quite huge change block. Perhaps that could be a preference option, that defines the granularity of the block.
adrian
Posts: 2879
Joined: Tue May 17, 2005 4:01 pm

Re: XML Diff: Example of non-optimal character compare

Post by adrian »

Hi,

I believe you mean "(1998) should be replaced by " (<date>1998</date>). I don't see the year being changed(1998 -> 1986) in your documents.
But anyway, given the fact that the actual content doesn't change and remains 1998, I consider correct the result of the algorithm(unaware of XML). It shows that <date> was inserted before 1998 and </date> was inserted after it: (<date>1998</date>).

If you are arguing that both date tags should be considered a single change then you are correct XML-wise, but then again the Characters algorithm is not XML-aware. On the other hand the XML-aware algorithms(XML Fast/Accurate) only see one large change because they see a text node on one side and XML markup on the other.


Your example is an extreme scenario, you are comparing an XML document that has markup with an XML document that has a large text node. This is the equivalent of comparing and XML document with a text document.
The algorithms provided by XML Diff are either text centric(Characters, Words, Lines) or XML centric(XML Fast, Accurate), there is no algorithm that can handle both accurately at the same time.

The block of changes should probably be at line level when comparing with the Lines algorithm. There is an option for merging adjacent differences(default disabled) but no option for merging block of changes which seem to be merged by default. I'll log this to our issue tracking tool to add an option.

Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Post Reply