[
Lists Home |
Date Index |
Thread Index
]
If you're concerned about byte-for-byte identical, hashing each file
is okay; if you're concerned about semantic identical (e.g., the order
of attributes doesn't matter) than use standard XML canonicalization
or something similar (but it won't be as good:)
Her's a portable python script that compares all files named on
the command-line:
; cat x.py
import sys,sha
from xml.dom.ext.reader import PyExpat
from xml.dom.ext.c14n import Canonicalize
hashes = {}
for f in sys.argv:
o = sha.sha()
if 1:
# simple hash of contents
o.update(open(f).read())
else:
# sha(c14n(doc))
r = PyExpat.Reader()
dom = r.fromStream(open(f))
o.update(Canonicalize(dom))
h = o.digest()
other = hashes.get(h, None)
if other:
print 'duplicate', f, other
else:
hashes[h] = f
;
--
Rich Salz Chief Security Architect
DataPower Technology http://www.datapower.com
XS40 XML Security Gateway http://www.datapower.com/products/xs40.html
XML Security Overview http://www.datapower.com/xmldev/xmlsecurity.html
|