xml-dev - Re: [xml-dev] hashing

Re: [xml-dev] hashing

[ Lists Home | Date Index | Thread Index ]

To: "Eric Hanson" <eric@aquameta.com>
Subject: Re: [xml-dev] hashing
From: "Jeff Greif" <jgreif@alumni.princeton.edu>
Date: Thu, 29 Apr 2004 13:28:24 -0700
Cc: <xml-dev@lists.xml.org>
References: <20040429195817.E62427@aquameta.com>

md5sum is a cryptographic hash using the MD5 algorithm.  It's not fast, but
it will do what you want.  It's available in linux, in cygwin, and probably
other ways.

In a reasonable command shell, where unix commands are available along with
md5sum,

md5sum *.xml | sort

will put the duplicate files on neighboring lines.

Jeff

----- Original Message ----- 
From: "Eric Hanson" <eric@aquameta.com>
To: <xml-dev@lists.xml.org>
Sent: Thursday, April 29, 2004 12:58 PM
Subject: [xml-dev] hashing


> I have a large collection of XML documents, and want to find and
> group any duplicates.  The obvious but slow way of doing this is
> to just compare them all to each other.  Is there a better
> approach?
>
> Particularly, is there any APIs or standards for "hashing" a
> document so that duplicates could be identified in a similar way
> to what you'd do with a hash table?

References:
- hashing
  - From: Eric Hanson <eric@aquameta.com>

Prev by Date: Re: [xml-dev] hashing
Next by Date: Re: [xml-dev] hashing
Previous by thread: Re: [xml-dev] hashing
Next by thread: WAY OFFTOPIC: ( RE: [xml-dev] ISO and the Standards Golden Hammer (was Re: [xml-d ev] You call that a standard?))
Index(es):
- Date
- Thread