Lists Home |
Date Index |
David Megginson wrote,
> The problem I was considering (in the shower) was detecting spam
> messages with minor variations, such as the insertion of the
> recipient's e-mail address in the body or the substitution of
> Zimbabwe for Nigeria. Assume that I have a database containing many
> millions of known spam messages, and that I want to check an incoming
> e-mail message against it. If I can narrow the field down to, say,
> 50 candidates after a very inexpensive operation, then my system will
> be much more efficient; I can then use edit-distance against the
> closest matches to see if the message really is likely spam.
OK, that makes more sense ... but it's a different problem from the one
suggested by your original question. Rather than generic closeness
you're after a witness to closeness relative to a given corpus.
> That said, based on private e-mail from another list member, I suspect
> that there may be nothing original about the algorithm I came up with;
> nevertheless, here it is for anyone who would care to take a peek:
Interesting. Ignoring the thresholding, I think your pre-test is roughly
equivalent to edit distance over the strings considered as unordered
sets of characters. A small string edit distance will imply a small
unordered character set edit distance, so I guess the latter could be a
useful witness to the former.