OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] [OT] Looking for a text algorithm

[ Lists Home | Date Index | Thread Index ]

David Megginson wrote,
> The problem I was considering (in the shower) was detecting spam
> messages with minor variations, such as the insertion of the
> recipient's e-mail address in the body or the substitution of
> Zimbabwe for Nigeria.  Assume that I have a database containing many
> millions of known spam messages, and that I want to check an incoming
> e-mail message against it.  If I can narrow the field down to, say,
> 50 candidates after a very inexpensive operation, then my system will
> be much more efficient; I can then use edit-distance against the
> closest matches to see if the message really is likely spam.

OK, that makes more sense ... but it's a different problem from the one 
suggested by your original question. Rather than generic closeness 
you're after a witness to closeness relative to a given corpus.

> That said, based on private e-mail from another list member, I suspect
> that there may be nothing original about the algorithm I came up with;
> nevertheless, here it is for anyone who would care to take a peek:
>  http://www.megginson.com/private/megginson-index-00.zip

Interesting. Ignoring the thresholding, I think your pre-test is roughly 
equivalent to edit distance over the strings considered as unordered 
sets of characters. A small string edit distance will imply a small 
unordered character set edit distance, so I guess the latter could be a 
useful witness to the former.




News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS