OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   RE: [xml-dev] Something altogether different?

[ Lists Home | Date Index | Thread Index ]

Salton's approach makes it easy to know when things are 
similar.  Then the human sorts out the noise.  That is 
fine for things that run at human speed.  That is not 
fine for things that run at machine speed and have 
quick system-wide effects.  The power laws of crap 
feeding back to crap have not been suspended.

Take the vector measures and tie them together with 
URIs across multiple notations for the same observations 
and that is an interesting system for machine learning 
as has been shown time and time again.  They aren't 
as useful for targeting munitions; they can be useful 
for fusing multiple systems and giving a human a 
short list, or better, a space of solutions, and that 
is what we see from Google et al.  The web works because 
human smarts take up the slack for computer dumb. 

Google is fine until you try to dispatch an emergency 
system based on it's address and maps.  Two problems:

1.  Locations can be off by half a mile or more.

2.  Satellite photos are stale (by as much as 18 months) 
and vary in the resolution of a given adjacent area that 
is less than ten miles.

3.  In the investigation that follows, one isn't allowed 
to mix unvetted data with vetted data (by policy, the 
name of the neighbor can't be entered without the neighbor 
having a defined role in the event (eg, a witness)).

Dumb things done with dumb data are fine until you need 
something smart and accurate fast.  Relaxing reliability 
to get deployment scale does work.  Ask any driver of a 
T-34.  Massed deployment always beats high potential assets 
in smaller numbers if you can sustain high initial casualty 
rates.

len


From: Ken North [mailto:kennorth@sbcglobal.net]

Len Bullard wrote:
2)  Where one can establish a similarity metric, is that good enough, as
Bosworth is claiming for human processes, for machine-processes?
Bosworth is playing fast and loose with the noise problems.

Cohen and Fan discuss the noise issue in the paper about the CF spider,
which
uses a variant of the cosine distance measure of textual similarity (used in
WHIRL):
"However, although the data is noisy, it seems reasonable to believe metrics
based on it can be used for comparative purposes. We note also that CF
systems
which can learn from this sort of noisy "observational" data (e.g.,
[Liebermann,
1995; Perkowitz & Etzioni, 1997]) are potentially far more valuable than CF
systems that require explicit noise-free ratings."

The solution to the semantic web might be millions of people creating
Atom/RSS,
but I'm more optimistic about applying machine learning with enough
hardware.
Google has already shown an array of processors can crunch the web's
content. If
you embark on creating Google++ using technologies such as WHIRL and the CF
spider, you'll need a large array of hardware. But as Bosworth noted in the
Powerpoint presentation, hardware is cheap.




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS