Lists Home |
Date Index |
Steve DeRose wrote:
>> To use markup well for this, it seems like you have to know something about
its semantics -- which is hard, but maybe avoidable.
>> Maybe we can apply a similar Hidden Markov Model for documents and markup
>> On the other hand, what about a simpler approach to analyzing and using
markup: what if Google were to do nothing more than to allow you to search for
your words/phrases *only* within particular element types?
At the risk of being repetitive, I'll point again to Cohen's research at AT&T on
WHIRL. (He's now at Carnegie Mellon.) Cohen's work takes the
approach of representing a document as a set of terms and computing textual
"the term-weight representation of a document can be a surprisingly effective
model of its semantic content; in particular, documents with intuitively similar
semantic content often have similar representations."
Claude Bullard wrote:
>> Abstract topics regardless of the kind of expression used (eg, HTML vs X3D or
SVG) should have the same vector values.
Cohen used Salton's model with fragments of text represented by document
vectors. From Cohen's '99 WHIRL paper:
"One advantage of this "vector space" representation is that the similarity of
two documents can be easily computed.
The excerpt below is from the 1999 paper. In a more recent paper published in
the ACM Transactions on Information Systems, he wrote:
"Inferences made by WHIRL are also surprisingly accurate, equaling the accuracy
of hand-coded normalization routines on one benchmark problem, and outperforming
exact matching with a plausible global domain on a second."
Excerpts from the 1999 paper:
In this paper we describe WHIRL (for Word-based Heterogeneous Information
Representation Language), a new type of information system that synergistically
combines logic-based and text-based representation methods. With respect to
text, WHIRL adopts a key tool of modern text-based information systems: the
term-weight representation for text, in which a document is represented as a set
of terms, each associated with a numeric weight indicating its relative
importance. (This is sometimes called a "bag of words" representation, since
terms usually correspond to words).
Term-based representations can be easily created and stored, and with suitable
indices, many operations can be carried out very efficiently. Another advantage
of this representation is that with a good weighting scheme, the term-weight
representation of a document can be a surprisingly effective model of its
semantic content; in particular, documents with intuitively similar semantic
content often have similar representations.
In WHIRL, this notion of similarity has been closely integrated with logical
deduction. WHIRL is a conventional logic (a subset of non-recursive Datalog)
that has been extended by introducing an atomic type for textual entities, and
an atomic operation for computing textual similarity.
The presence of the "soft" similarity predicate necessitates a "soft" semantics;
inferences in WHIRL are associated with numeric scores, and presented to the
user in decreasing order by score, much like the documents returned by a
ranked-retrieval IR system.
We will show that WHIRL strictly generalizes both IR ranked retrieval and
logical deduction; that non-trivial queries concerning large databases can be
answered efficiently; that WHIRL can be used to integrate data from distinct,
distributed, heterogeneous, information sources, such as those found on the Web;
that WHIRL can be used effectively for inductive classification of text; and
finally that WHIRL can be used to extract data from structured documents, and to
semi-automatically generate "wrappers" (extraction programs) for structured
The general idea behind the vector representation is that the magnitude of the
component vt is related to the "importance" of the term t in the document
represented by ...
One advantage of this "vector space" representation is that the similarity of
two documents can be easily computed.