[
Lists Home |
Date Index |
Thread Index
]
David, you seemed to be determined to play Bacchus come to
demand a seat at the table of the gods by disruption.
It's an immature strategy.
Yes, VSM is a form of document indexing and classification.
It uses term frequency to create similarity metrics, typically
a cosine for the angle between terms which normalizes the
distance. There are a LOT of papers you can read freely
available simply by entering "vector space model" into
that ever-loving simple box that does such a good job for
cases where SQL falls on its bum. No structure == no SQL.
So one last time as clear as I can:
1. The problem of weakly structured (think RSS) or
unstructured (think notepad files) data is classification.
2. The problem of XML is it requires apriori classification
that may result in weak structuring or high costs.
3. The problem of the publish/subscribe model is that it
invokes problems one and two automatically if a human does
not intervene. Notification based systems rely on triggers
because humans know where to put those. Humans are expensive
and make mistakes. Wyman is right: analyze the query.
This is the classic pattern identification problem. Regardless
of the database system you use, query analysis is required to
enable matching. In an unstructured or weakly structured world,
the information of interest is in the text nodes. It is like
having a message system that only contains two fields: call and response.
Vector Space Models and others like use term frequency to establish
similarity metrics. These metrics can be used to cluster documents
with similar content even in the face of polysemy and synonymy. These
are relatively old techniques and do require preprocessing but HTML
was an even older technique as was markup before they were recognized
by the database community.
Again, the problem of XML is apriori classification. Just as HTML was a
leap backwards
to make forward progress, the publish/subscribe methods, particularly
where based on weakly tagged message formats such as RSS require another
look
to the past to bring forward the worst/best of the IR technologies because
the
formats and models create exactly the same problems. The database gurus of
fifteen
years ago did not believe markup was a solution for database integration
issues.
The markup gurus of today don't believe that geometry is a solution for
pattern analysis. The past is not always informative if the environment
has changed; on the other hand, a proven technique in a new environment
can work better. HTML and XML are the proof that for the most part,
the SGMLers were right and the database experts were wrong.
A day in the library is worth a month in the lab.
len
From: David Lyon [mailto:david.lyon@computergrid.net]
ok, well I'm lost. Vectors are a simple mathematic
paradigm. How do they apply to xml? or is it just
a new type of marketing speek?
|