OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] hashing

[ Lists Home | Date Index | Thread Index ]

If you're concerned about byte-for-byte identical, hashing each file
is okay; if you're concerned about semantic identical (e.g., the order
of attributes doesn't matter) than use standard XML canonicalization
or something similar (but it won't be as good:)

Her's a portable python script that compares all files named on
the command-line:

; cat x.py
import sys,sha
from xml.dom.ext.reader import PyExpat
from xml.dom.ext.c14n import Canonicalize

hashes = {}
for f in sys.argv:
    o = sha.sha()
    if 1:
        # simple hash of contents
        # sha(c14n(doc))
        r = PyExpat.Reader()
        dom = r.fromStream(open(f))
    h = o.digest()
    other = hashes.get(h, None)
    if other:
        print 'duplicate', f, other
        hashes[h] = f

Rich Salz                  Chief Security Architect
DataPower Technology       http://www.datapower.com
XS40 XML Security Gateway  http://www.datapower.com/products/xs40.html
XML Security Overview      http://www.datapower.com/xmldev/xmlsecurity.html


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS