[
Lists Home |
Date Index |
Thread Index
]
md5sum is a cryptographic hash using the MD5 algorithm. It's not fast, but
it will do what you want. It's available in linux, in cygwin, and probably
other ways.
In a reasonable command shell, where unix commands are available along with
md5sum,
md5sum *.xml | sort
will put the duplicate files on neighboring lines.
Jeff
----- Original Message -----
From: "Eric Hanson" <eric@aquameta.com>
To: <xml-dev@lists.xml.org>
Sent: Thursday, April 29, 2004 12:58 PM
Subject: [xml-dev] hashing
> I have a large collection of XML documents, and want to find and
> group any duplicates. The obvious but slow way of doing this is
> to just compare them all to each other. Is there a better
> approach?
>
> Particularly, is there any APIs or standards for "hashing" a
> document so that duplicates could be identified in a similar way
> to what you'd do with a hash table?
- References:
- hashing
- From: Eric Hanson <eric@aquameta.com>
|