Lists Home |
Date Index |
On Mon, Apr 25, 2005 at 03:49:35PM -0500, Bullard, Claude L (Len) wrote:
> So where we do understand how the vector model
> works for text analysis,
If you mean the cosine vector similarity model espoused by
the late Dr. Gerald Salton and others, I think what we know
is that it was an interesting theory that supported a lot of
useful research, but has a number of practical difficulties.
I don't know how Dr Cohen (cited earlier by Steve DeRose) has
dealt with them. Difficulties include the fact that humans
attribute significance (in English) to word order, and also
use colocation of terms to help with sense disambiguation.
Another difficulty with earlier systems like SMART was that
sufficiently large documents contained all the terms -- use of
markup to do term weighting for individual sections (or even
paragraphs) can be a significant win in some environments.
In the extract, Cohen mentions that term weighting can be
"surprisingly effective" and goes on to say that
> One advantage of this "vector space" representation is that the
> similarity of two documents can be easily computed.
Sometimes the thing that's easy to implement gets far enough
of the way that doesn't seem worth implementing anything better.
The use of fuzzy logic (is this a derivative of Zadeh?) is also
Liam Quin, W3C XML Activity Lead, http://www.w3.org/People/Quin/