[
Lists Home |
Date Index |
Thread Index
]
Miles Sabin writes:
> Judging from your examples it looks like you're after a closeness
> criterion derived from longest common subsequences. But I don't see how
> you could use that to usefully construct a single characteristic number
> for _any_ string of _any_ length: with only 32 or 64 bits to play with,
> many many completely unrelated (on any criterion) strings will collide
> on the same code.
Quite right. However, if the alternative is a linear search (say,
using an edit-distance algorithm), then reducing the number of
candidates by a few orders of magnitude would not necessarily be a bad
thing.
The problem I was considering (in the shower) was detecting spam
messages with minor variations, such as the insertion of the
recipient's e-mail address in the body or the substitution of Zimbabwe
for Nigeria. Assume that I have a database containing many millions
of known spam messages, and that I want to check an incoming e-mail
message against it. If I can narrow the field down to, say, 50
candidates after a very inexpensive operation, then my system will be
much more efficient; I can then use edit-distance against the closest
matches to see if the message really is likely spam.
That said, based on private e-mail from another list member, I suspect
that there may be nothing original about the algorithm I came up with;
nevertheless, here it is for anyone who would care to take a peek:
http://www.megginson.com/private/megginson-index-00.zip
All the best,
David
--
David Megginson, david@megginson.com, http://www.megginson.com/
|