OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   RE: [xml-dev] Something altogether different?

[ Lists Home | Date Index | Thread Index ]

Right.  

Now, consider the case of a document that fused multiple 
means or channels of expression such that intent can only 
be deduced by understanding the combined effect of the 
languages.  When you tell a joke, a deadpan expression 
can be deadly or funnier but it is the deadpan expression 
that makes the difference.   It isn't as important to rows 
of numbers in a spreadsheet but it is very important to a 
negotiation or to a system attempting to analyse pre-act 
intent.  So where we do understand how the vector model 
works for text analysis, do we understand how to apply 
it to a *text* that includes video and audio as integral 
parts of the *text* and can we combine these into a 
higher level space vector term (intuitively, yes. A 
tensor product?  Dunno.  Seems obvious and is referenced 
in other papers on applying quantum logic to search
(Chavoustie, et al))?  

A record in a multimedia database has multiple media. 
We can mark these up given a vocabulary and out-of-line 
markup.  These are multi-namespace documents in the 
abstract sense.   Now is it worth naming the comparison 
as a topic in its own right (A URI for common topics 
could have utility and perhaps the similarity metric 
is WHAT it identifies).

Thanks for the paper URL, Ken.  Off to read that now. 
I checked and it references Salton.

len


From: Ken North [mailto:kennorth@sbcglobal.net]

Steve DeRose wrote:
>> To use markup well for this, it seems like you have to know something
about
its semantics -- which is hard, but maybe avoidable.
>> Maybe we can apply a similar Hidden Markov Model for documents and markup
analysis?
>> On the other hand, what about a simpler approach to analyzing and using
markup: what if Google were to do nothing more than to allow you to search
for
your words/phrases *only* within particular element types?

At the risk of being repetitive, I'll point again to Cohen's research at
AT&T on
WHIRL. (He's now at Carnegie Mellon.)  Cohen's work takes the
approach of representing a document as a set of terms and computing textual
similarity.

"the term-weight representation of a document can be a surprisingly
effective
model of its semantic content; in particular, documents with intuitively
similar
semantic content often have similar representations."

Claude Bullard wrote:
>> Abstract topics regardless of the kind of expression used (eg, HTML vs
X3D or
SVG) should have the same vector values.

Cohen used Salton's model with fragments of text represented by document
vectors. From Cohen's '99 WHIRL paper:

"One advantage of this "vector space" representation is that the similarity
of
two documents can be easily computed.

The excerpt below is from the 1999 paper. In a more recent paper published
in
the ACM Transactions on Information Systems, he wrote:
http://www-2.cs.cmu.edu/~wcohen/postscript/tois-whirl.pdf

"Inferences made by WHIRL are also surprisingly accurate, equaling the
accuracy
of hand-coded normalization routines on one benchmark problem, and
outperforming
exact matching with a plausible global domain on a second."

Excerpts from the 1999 paper:
--------------------------------
In this paper we describe WHIRL (for Word-based Heterogeneous Information
Representation Language), a new type of information system that
synergistically
combines logic-based and text-based representation methods. With respect to
text, WHIRL adopts a key tool of modern text-based information systems: the
term-weight representation for text, in which a document is represented as a
set
of terms, each associated with a numeric weight indicating its relative
importance. (This is sometimes called a "bag of words" representation, since
terms usually correspond to words).

Term-based representations can be easily created and stored, and with
suitable
indices, many operations can be carried out very efficiently. Another
advantage
of this representation is that with a good weighting scheme, the term-weight
representation of a document can be a surprisingly effective model of its
semantic content; in particular, documents with intuitively similar semantic
content often have similar representations.
...
In WHIRL, this notion of similarity has been closely integrated with logical
deduction. WHIRL is a conventional logic (a subset of non-recursive Datalog)
that has been extended by introducing an atomic type for textual entities,
and
an atomic operation for computing textual similarity.
The presence of the "soft" similarity predicate necessitates a "soft"
semantics;
inferences in WHIRL are associated with numeric scores, and presented to the
user in decreasing order by score, much like the documents returned by a
ranked-retrieval IR system.
...
We will show that WHIRL strictly generalizes both IR ranked retrieval and
logical deduction; that non-trivial queries concerning large databases can
be
answered efficiently; that WHIRL can be used to integrate data from
distinct,
distributed, heterogeneous, information sources, such as those found on the
Web;
that WHIRL can be used effectively for inductive classification of text; and
finally that WHIRL can be used to extract data from structured documents,
and to
semi-automatically generate "wrappers" (extraction programs) for structured
documents.
...
The general idea behind the vector representation is that the magnitude of
the
component vt is related to the "importance" of the term t in the document
represented by ...
One advantage of this "vector space" representation is that the similarity
of
two documents can be easily computed.




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS