Lists Home |
Date Index |
Michael Champion wrote:
>> I dunno ... I can't say this vision appeals to me, but I can see the
momentum for RSS and microformats converging to produce this kind of
thing more easily than I can envision the Semantic Web
Resurrecting a topic from five years ago:
Domain vocabularies and search engines such as WHIRL have a lot of potential for
moving us to a Semantic Web.
WHIRL implemented a measure of textual similarity that permitted similarity
searching. The measure used in WHIRL (1998) was also used for the collaborative
filtering (CF) spider described in this paper:
Web-Collaborative Filtering: Recommending Music by Crawling The Web
"We show that it is possible to collect data that is useful for collaborative
filtering (CF) using an autonomous Web spider. In CF, entities are recommended
to a new user based on the stated preferences of other, similar users. We
describe a CF spider that collects from the Web lists of semantically related
entities. These lists can then be used by existing CF algorithms by encoding
them as "pseudo-users". Importantly, the spider can collect useful data without
pre-programmed knowledge about the format of particular pages or particular
sites. Instead, the CF spider uses commercial Web-search engines to find pages
likely to contain lists in the domain of interest, and then applies
previously-proposed heuristics [Cohen, 1999] to extract lists from these pages.
We show that data collected by this spider is nearly as effective for CF as data
collected from real users, and more effective than data collected by two
plausible hand-programmed spiders. In some cases, autonomously spidered data can
also be combined with actual user data to improve performance."
Len Bullard wrote:
2) Where one can establish a similarity metric, is that good enough, as
Bosworth is claiming for human processes, for machine-processes?
Bosworth is playing fast and loose with the noise problems.
Cohen and Fan discuss the noise issue in the paper about the CF spider, which
uses a variant of the cosine distance measure of textual similarity (used in
"However, although the data is noisy, it seems reasonable to believe metrics
based on it can be used for comparative purposes. We note also that CF systems
which can learn from this sort of noisy "observational" data (e.g., [Liebermann,
1995; Perkowitz & Etzioni, 1997]) are potentially far more valuable than CF
systems that require explicit noise-free ratings."
The solution to the semantic web might be millions of people creating Atom/RSS,
but I'm more optimistic about applying machine learning with enough hardware.
Google has already shown an array of processors can crunch the web's content. If
you embark on creating Google++ using technologies such as WHIRL and the CF
spider, you'll need a large array of hardware. But as Bosworth noted in the
Powerpoint presentation, hardware is cheap.