xml-dev - Re: [xml-dev] [OT] Looking for a text algorithm

Re: [xml-dev] [OT] Looking for a text algorithm

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] [OT] Looking for a text algorithm
From: Miles Sabin <miles@milessabin.com>
Date: Sun, 9 Mar 2003 15:16:14 +0000
In-reply-to: <15979.19755.44779.643100@megginson.com>
References: <15977.64423.65785.440495@megginson.com> <200303091309.47125.miles@milessabin.com> <15979.19755.44779.643100@megginson.com>

David Megginson wrote,
> The problem I was considering (in the shower) was detecting spam
> messages with minor variations, such as the insertion of the
> recipient's e-mail address in the body or the substitution of
> Zimbabwe for Nigeria.  Assume that I have a database containing many
> millions of known spam messages, and that I want to check an incoming
> e-mail message against it.  If I can narrow the field down to, say,
> 50 candidates after a very inexpensive operation, then my system will
> be much more efficient; I can then use edit-distance against the
> closest matches to see if the message really is likely spam.

OK, that makes more sense ... but it's a different problem from the one 
suggested by your original question. Rather than generic closeness 
you're after a witness to closeness relative to a given corpus.

> That said, based on private e-mail from another list member, I suspect
> that there may be nothing original about the algorithm I came up with;
> nevertheless, here it is for anyone who would care to take a peek:
>
>  http://www.megginson.com/private/megginson-index-00.zip

Interesting. Ignoring the thresholding, I think your pre-test is roughly 
equivalent to edit distance over the strings considered as unordered 
sets of characters. A small string edit distance will imply a small 
unordered character set edit distance, so I guess the latter could be a 
useful witness to the former.

Cheers,


Miles

References:
- [OT] Looking for a text algorithm
  - From: David Megginson <david@megginson.com>
- Re: [xml-dev] [OT] Looking for a text algorithm
  - From: Miles Sabin <miles@milessabin.com>
- Re: [xml-dev] [OT] Looking for a text algorithm
  - From: David Megginson <david@megginson.com>

Prev by Date: Pointers to PPT? (was Re: [xml-dev] W3C Tech Plenary highlights)
Next by Date: Re: [xml-dev] Pointers to PPT? (was Re: [xml-dev] W3C Tech Plenary highlights)
Previous by thread: Re: [xml-dev] [OT] Looking for a text algorithm
Next by thread: RE: [xml-dev] [OT] Looking for a text algorithm
Index(es):
- Date
- Thread