xml-dev - Re: [xml-dev] [OT] Looking for a text algorithm

Re: [xml-dev] [OT] Looking for a text algorithm

[ Lists Home | Date Index | Thread Index ]

To: David Megginson <david@megginson.com>
Subject: Re: [xml-dev] [OT] Looking for a text algorithm
From: Mitch Amiano <mamiano@nc.rr.com>
Date: Sat, 08 Mar 2003 15:08:53 -0500
Cc: XML Developers List <xml-dev@lists.xml.org>
In-reply-to: <15977.64423.65785.440495@megginson.com>
Organization: Software Adjuvant
References: <15977.64423.65785.440495@megginson.com>
Reply-to: mamiano@nc.rr.com
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.3b) Gecko/20030104

Don't know about the numeric approach to your problem, but it sounds 
a lot like matching DNA strings while accounting for frameshift errors. 
Dan Gusfield's book, Algorithms on Strings, Trees, and Sequences devotes 
a lot of space to that and similar problems. 

- Mitch

David Megginson wrote:
> I'm looking for references to a specific kind of text algorithm -- the
> algorithm should generate a number (say, 32 or 64 bits) for any text
> string of any length, similar to a hash.  However, it should be
> possible to compare the numbers for different strings to tell how
> close they are to each other.  For example, the numbers for
> 
> 1. To be or not to be.
> 
> 2. Two bees or not two bees.
> 
> 3. I don't know whether to be or not to be.
> 
> should indicate that three strings are relatively close to each other
> (while a hash number would give no indication at all).
> 
> I'm asking only out of interest, because I came up with a simple
> algorithm to do this while I was in the shower yesterday, and it would
> be fun to see how close it is to what the pros use for spam detection
> and so on.
> 
> Note that I'm not looking for algorithms based on edit-distance,
> bag-of-words, and so on.
> 
> 
> Thanks in advance,
> 
> 
> David
>

References:
- [OT] Looking for a text algorithm
  - From: David Megginson <david@megginson.com>

Prev by Date: Re: [xml-dev] Extending anyType
Next by Date: Re: [xml-dev] Registered namespace prefixes
Previous by thread: [OT] Looking for a text algorithm
Next by thread: Re: [xml-dev] [OT] Looking for a text algorithm
Index(es):
- Date
- Thread