OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
RE: [xml-dev] XML diff tool / algorithm

This is an interesting use case.  I have not personally seen anything that can handle what you're asking 'ideally'
(although I've used several diff tools).
In your case you need a tradeoff between ideal representation and time and space.

I suggest maybe a compromise, depending on how your DB handles fragmentation.   A lot of real-world XML, especially the big ones, are really large lists or heterogeneous collections that can be efficiently split at the root+1 level nodes.
If you split documents at  this level (or some customizable level) then do a binary compare, or MD5 of each fragment (or however your nodes are represented) you may find an efficient algorithm which performs reasonably in time & space for a useful set of cases.
Now finding inserts/deletes may be harder. - might require multiple passes.  But this approach would turn the diff into a linear instead of hierarchical diff, which has many known/published algorithms.  Certainly not perfect but may be vastly better then nothing.

David A. Lee

-----Original Message-----
From: Johannes.Lichtenberger [mailto:Johannes.Lichtenberger@uni-konstanz.de] 
Sent: Tuesday, June 28, 2011 6:25 AM
To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] XML diff tool / algorithm

On 06/28/2011 11:49 AM, Michael Kay wrote:
> On 27/06/2011 23:42, Johannes.Lichtenberger wrote:
>> Hello,
>> does someone use an XML-diff tool, which can handle large XML instances
>> up to several GBs and more?
> DeltaXML specializes in this area.
> www.deltaxml.com

Hello Michael,

that's one of the things I've found very early, but I've forgotten to
meantion that I'm searching free software or just an algorithm which can
be used in conjunction with our open source XML database system.

It should be used for importing the data, to update the stored data to
new revisions, since updates can be handled efficiently based on the
used algorithm in the database system (Full, Incremental, Differential
and another one, which is going to be explained in a paper soon). For
the incremental and differential approach the found changes have to be
minimal or at least smaller than inserting a full dump.

I haven't found something which fulfills my criterias (minimal edit
scripts, which in our case means minimal updates to the storage, handle
large instances and is running in reasonable times (maybe only in the
worst case O(n^2)).



XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS