Lists Home |
Date Index |
David Megginson wrote,
> I'm looking for references to a specific kind of text algorithm --
> the algorithm should generate a number (say, 32 or 64 bits) for any
> text string of any length, similar to a hash. However, it should be
> possible to compare the numbers for different strings to tell how
> close they are to each other. For example, the numbers for
> 1. To be or not to be.
> 2. Two bees or not two bees.
> 3. I don't know whether to be or not to be.
> should indicate that three strings are relatively close to each other
> (while a hash number would give no indication at all).
Umm ... define "close".
Judging from your examples it looks like you're after a closeness
criterion derived from longest common subsequences. But I don't see how
you could use that to usefully construct a single characteristic number
for _any_ string of _any_ length: with only 32 or 64 bits to play with,
many many completely unrelated (on any criterion) strings will collide
on the same code.