[
Lists Home |
Date Index |
Thread Index
]
Salton's approach makes it easy to know when things are
similar. Then the human sorts out the noise. That is
fine for things that run at human speed. That is not
fine for things that run at machine speed and have
quick system-wide effects. The power laws of crap
feeding back to crap have not been suspended.
Take the vector measures and tie them together with
URIs across multiple notations for the same observations
and that is an interesting system for machine learning
as has been shown time and time again. They aren't
as useful for targeting munitions; they can be useful
for fusing multiple systems and giving a human a
short list, or better, a space of solutions, and that
is what we see from Google et al. The web works because
human smarts take up the slack for computer dumb.
Google is fine until you try to dispatch an emergency
system based on it's address and maps. Two problems:
1. Locations can be off by half a mile or more.
2. Satellite photos are stale (by as much as 18 months)
and vary in the resolution of a given adjacent area that
is less than ten miles.
3. In the investigation that follows, one isn't allowed
to mix unvetted data with vetted data (by policy, the
name of the neighbor can't be entered without the neighbor
having a defined role in the event (eg, a witness)).
Dumb things done with dumb data are fine until you need
something smart and accurate fast. Relaxing reliability
to get deployment scale does work. Ask any driver of a
T-34. Massed deployment always beats high potential assets
in smaller numbers if you can sustain high initial casualty
rates.
len
From: Ken North [mailto:kennorth@sbcglobal.net]
Len Bullard wrote:
2) Where one can establish a similarity metric, is that good enough, as
Bosworth is claiming for human processes, for machine-processes?
Bosworth is playing fast and loose with the noise problems.
Cohen and Fan discuss the noise issue in the paper about the CF spider,
which
uses a variant of the cosine distance measure of textual similarity (used in
WHIRL):
"However, although the data is noisy, it seems reasonable to believe metrics
based on it can be used for comparative purposes. We note also that CF
systems
which can learn from this sort of noisy "observational" data (e.g.,
[Liebermann,
1995; Perkowitz & Etzioni, 1997]) are potentially far more valuable than CF
systems that require explicit noise-free ratings."
The solution to the semantic web might be millions of people creating
Atom/RSS,
but I'm more optimistic about applying machine learning with enough
hardware.
Google has already shown an array of processors can crunch the web's
content. If
you embark on creating Google++ using technologies such as WHIRL and the CF
spider, you'll need a large array of hardware. But as Bosworth noted in the
Powerpoint presentation, hardware is cheap.
|