Lists Home |
Date Index |
Now, consider the case of a document that fused multiple
means or channels of expression such that intent can only
be deduced by understanding the combined effect of the
languages. When you tell a joke, a deadpan expression
can be deadly or funnier but it is the deadpan expression
that makes the difference. It isn't as important to rows
of numbers in a spreadsheet but it is very important to a
negotiation or to a system attempting to analyse pre-act
intent. So where we do understand how the vector model
works for text analysis, do we understand how to apply
it to a *text* that includes video and audio as integral
parts of the *text* and can we combine these into a
higher level space vector term (intuitively, yes. A
tensor product? Dunno. Seems obvious and is referenced
in other papers on applying quantum logic to search
(Chavoustie, et al))?
A record in a multimedia database has multiple media.
We can mark these up given a vocabulary and out-of-line
markup. These are multi-namespace documents in the
abstract sense. Now is it worth naming the comparison
as a topic in its own right (A URI for common topics
could have utility and perhaps the similarity metric
is WHAT it identifies).
Thanks for the paper URL, Ken. Off to read that now.
I checked and it references Salton.
From: Ken North [mailto:email@example.com]
Steve DeRose wrote:
>> To use markup well for this, it seems like you have to know something
its semantics -- which is hard, but maybe avoidable.
>> Maybe we can apply a similar Hidden Markov Model for documents and markup
>> On the other hand, what about a simpler approach to analyzing and using
markup: what if Google were to do nothing more than to allow you to search
your words/phrases *only* within particular element types?
At the risk of being repetitive, I'll point again to Cohen's research at
WHIRL. (He's now at Carnegie Mellon.) Cohen's work takes the
approach of representing a document as a set of terms and computing textual
"the term-weight representation of a document can be a surprisingly
model of its semantic content; in particular, documents with intuitively
semantic content often have similar representations."
Claude Bullard wrote:
>> Abstract topics regardless of the kind of expression used (eg, HTML vs
SVG) should have the same vector values.
Cohen used Salton's model with fragments of text represented by document
vectors. From Cohen's '99 WHIRL paper:
"One advantage of this "vector space" representation is that the similarity
two documents can be easily computed.
The excerpt below is from the 1999 paper. In a more recent paper published
the ACM Transactions on Information Systems, he wrote:
"Inferences made by WHIRL are also surprisingly accurate, equaling the
of hand-coded normalization routines on one benchmark problem, and
exact matching with a plausible global domain on a second."
Excerpts from the 1999 paper:
In this paper we describe WHIRL (for Word-based Heterogeneous Information
Representation Language), a new type of information system that
combines logic-based and text-based representation methods. With respect to
text, WHIRL adopts a key tool of modern text-based information systems: the
term-weight representation for text, in which a document is represented as a
of terms, each associated with a numeric weight indicating its relative
importance. (This is sometimes called a "bag of words" representation, since
terms usually correspond to words).
Term-based representations can be easily created and stored, and with
indices, many operations can be carried out very efficiently. Another
of this representation is that with a good weighting scheme, the term-weight
representation of a document can be a surprisingly effective model of its
semantic content; in particular, documents with intuitively similar semantic
content often have similar representations.
In WHIRL, this notion of similarity has been closely integrated with logical
deduction. WHIRL is a conventional logic (a subset of non-recursive Datalog)
that has been extended by introducing an atomic type for textual entities,
an atomic operation for computing textual similarity.
The presence of the "soft" similarity predicate necessitates a "soft"
inferences in WHIRL are associated with numeric scores, and presented to the
user in decreasing order by score, much like the documents returned by a
ranked-retrieval IR system.
We will show that WHIRL strictly generalizes both IR ranked retrieval and
logical deduction; that non-trivial queries concerning large databases can
answered efficiently; that WHIRL can be used to integrate data from
distributed, heterogeneous, information sources, such as those found on the
that WHIRL can be used effectively for inductive classification of text; and
finally that WHIRL can be used to extract data from structured documents,
semi-automatically generate "wrappers" (extraction programs) for structured
The general idea behind the vector representation is that the magnitude of
component vt is related to the "importance" of the term t in the document
represented by ...
One advantage of this "vector space" representation is that the similarity
two documents can be easily computed.