OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: [xml-dev] More on Vector Models

[ Lists Home | Date Index | Thread Index ]

David, you seemed to be determined to play Bacchus come to 
demand a seat at the table of the gods by disruption.  
It's an immature strategy.  

Yes, VSM is a form of document indexing and classification. 
It uses term frequency to create similarity metrics, typically 
a cosine for the angle between terms which normalizes the 
distance.  There are a LOT of papers you can read freely 
available simply by entering "vector space model" into 
that ever-loving simple box that does such a good job for 
cases where SQL falls on its bum.  No structure == no SQL.

So one last time as clear as I can:

1.  The problem of weakly structured (think RSS) or 
unstructured (think notepad files) data is classification. 

2.  The problem of XML is it requires apriori classification 
that may result in weak structuring or high costs.

3.  The problem of the publish/subscribe model is that it 
invokes problems one and two automatically if a human does 
not intervene.  Notification based systems rely on triggers 
because humans know where to put those.  Humans are expensive 
and make mistakes.  Wyman is right: analyze the query. 

This is the classic pattern identification problem.  Regardless 
of the database system you use, query analysis is required to 
enable matching.  In an unstructured or weakly structured world, 
the information of interest is in the text nodes.  It is like 
having a message system that only contains two fields: call and response.

Vector Space Models and others like use term frequency to establish 
similarity metrics.   These metrics can be used to cluster documents 
with similar content even in the face of polysemy and synonymy.  These 
are relatively old techniques and do require preprocessing but HTML 
was an even older technique as was markup before they were recognized 
by the database community.  

Again, the problem of XML is apriori classification.  Just as HTML was a
leap backwards 
to make forward progress, the publish/subscribe methods, particularly 
where based on weakly tagged message formats such as RSS require another
to the past to bring forward the worst/best of the IR technologies because
formats and models create exactly the same problems.  The database gurus of
years ago did not believe markup was a solution for database integration
The markup gurus of today don't believe that geometry is a solution for 
pattern analysis.  The past is not always informative if the environment 
has changed; on the other hand, a proven technique in a new environment 
can work better.  HTML and XML are the proof that for the most part, 
the SGMLers were right and the database experts were wrong.

A day in the library is worth a month in the lab.


From: David Lyon [mailto:david.lyon@computergrid.net]

ok, well I'm lost. Vectors are a simple mathematic
paradigm. How do they apply to xml? or is it just
a new type of marketing speek?


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS