xml-dev - Re: [xml-dev] [OT] Looking for a text algorithm

Re: [xml-dev] [OT] Looking for a text algorithm

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] [OT] Looking for a text algorithm
From: David Megginson <david@megginson.com>
Date: Sun, 9 Mar 2003 09:18:19 -0500
In-reply-to: <200303091309.47125.miles@milessabin.com>
References: <15977.64423.65785.440495@megginson.com><200303091309.47125.miles@milessabin.com>

Miles Sabin writes:

 > Judging from your examples it looks like you're after a closeness 
 > criterion derived from longest common subsequences. But I don't see how 
 > you could use that to usefully construct a single characteristic number 
 > for _any_ string of _any_ length: with only 32 or 64 bits to play with, 
 > many many completely unrelated (on any criterion) strings will collide 
 > on the same code.

Quite right.  However, if the alternative is a linear search (say,
using an edit-distance algorithm), then reducing the number of
candidates by a few orders of magnitude would not necessarily be a bad
thing.

The problem I was considering (in the shower) was detecting spam
messages with minor variations, such as the insertion of the
recipient's e-mail address in the body or the substitution of Zimbabwe
for Nigeria.  Assume that I have a database containing many millions
of known spam messages, and that I want to check an incoming e-mail
message against it.  If I can narrow the field down to, say, 50
candidates after a very inexpensive operation, then my system will be
much more efficient; I can then use edit-distance against the closest
matches to see if the message really is likely spam.

That said, based on private e-mail from another list member, I suspect
that there may be nothing original about the algorithm I came up with;
nevertheless, here it is for anyone who would care to take a peek:

  http://www.megginson.com/private/megginson-index-00.zip

All the best,

David

-- 
David Megginson, david@megginson.com, http://www.megginson.com/

Follow-Ups:
- Re: [xml-dev] [OT] Looking for a text algorithm
  - From: John Cowan <cowan@mercury.ccil.org>
- RE: [xml-dev] [OT] Looking for a text algorithm
  - From: "Don Park" <donpark@docuverse.com>
- Re: [xml-dev] [OT] Looking for a text algorithm
  - From: Miles Sabin <miles@milessabin.com>

References:
- [OT] Looking for a text algorithm
  - From: David Megginson <david@megginson.com>
- Re: [xml-dev] [OT] Looking for a text algorithm
  - From: Miles Sabin <miles@milessabin.com>

Prev by Date: Re: [xml-dev] [OT] Looking for a text algorithm
Next by Date: Pointers to PPT? (was Re: [xml-dev] W3C Tech Plenary highlights)
Previous by thread: Re: [xml-dev] [OT] Looking for a text algorithm
Next by thread: Re: [xml-dev] [OT] Looking for a text algorithm
Index(es):
- Date
- Thread