[
Lists Home |
Date Index |
Thread Index
]
Don't know about the numeric approach to your problem, but it sounds
a lot like matching DNA strings while accounting for frameshift errors.
Dan Gusfield's book, Algorithms on Strings, Trees, and Sequences devotes
a lot of space to that and similar problems.
- Mitch
David Megginson wrote:
> I'm looking for references to a specific kind of text algorithm -- the
> algorithm should generate a number (say, 32 or 64 bits) for any text
> string of any length, similar to a hash. However, it should be
> possible to compare the numbers for different strings to tell how
> close they are to each other. For example, the numbers for
>
> 1. To be or not to be.
>
> 2. Two bees or not two bees.
>
> 3. I don't know whether to be or not to be.
>
> should indicate that three strings are relatively close to each other
> (while a hash number would give no indication at all).
>
> I'm asking only out of interest, because I came up with a simple
> algorithm to do this while I was in the shower yesterday, and it would
> be fun to see how close it is to what the pros use for spam detection
> and so on.
>
> Note that I'm not looking for algorithms based on edit-distance,
> bag-of-words, and so on.
>
>
> Thanks in advance,
>
>
> David
>
|