[
Lists Home |
Date Index |
Thread Index
]
Title: RE: [xml-dev] Something altogether
different?
At 11:58 -0500 2005-04-22, Bullard, Claude L (Len) wrote:
1) Can processes be reliable given
noisy data?
Of course -- just have to read Claude Shannon. Though here's one
interesting bit about noise I just ran into: At
http://www.maths.ex.ac.uk/~mwatkins/zeta/surprising.htm
>Indirectly, as a result of studying nonlinear dynamics Marek
Wolf discovered two instances of apparent fractality within the
distribution of prime numbers ([W2-3]). These discoveries were
realised experimentally using powerful computers. Wolf's resulting
interest in the distribution of the primes led him to experimentally
discover the presence of 1/f [pink] noise when the primes
are treated as a 'signal' in the sense of information theory ([W4]).
This is also a self-similar (scale invariant, or fractal) property of
the distribution of primes.
Connecting noise/information theory to the Reimann hypothesis --
now there's "Something altogether different", especially
disruptive because trapdoor encryption methods depend on our not
knowing how to find prime factors fast enough.... Oops....
...
I suggest a review of the works of Salton
et al on
the vector space model, and the new refinements of
Dominick Kuropka et al on topic-based vector space
models. Consider these in terms of namespaces as
provided by XML, and the implications
given aggregate
...
I once spent a while working on the idea of incorporating markup
into Salton-like metrics. The problem I ran into was that Salton's
stuff is working solely at document-level, so even the fact that two
words are merely at opposite ends of the document (versus being
adjacent, for example) doesn't enter in. So markup giving you finer
distinctions of co-relevance wouldn't help. First you have to find how
to apply Salton-ish methods to finer-grained objects, which is not
trivial. There are a couple papers on that, but last I looked, nothing
very effective. To use markup well for this, it seems like you have to
know something about its semantics -- which is hard, but maybe
avoidable.
In AI, a similar issue of accuracy vs.
speed/simplicity/scalability in the face of noise and ambiguity was
solved in the late 80's. Turns out, it's hard to assign the right part
of speech to words. Almost everything is ambiguous (like "dog"
can be a verb). Linguists had shown that there are cases where you
*cannot* determine which way a word is functioning without knowing the
whole semantics -- the reliability issue again. But getting the whole
semantics is a lot of work, especially if you haven't figured out the
part of speech yet. Ken Church and I showed in 1987 that you could get
*better* reliability with purely statistical methods that ignored
semantic questions. Yes, we got the "proof" cases wrong --
but we did better overall, and the method was practical (about O(ln N)
instead of O(N**3), for any geeks among us). Now part-of-speech is
nearly always done that way.
Maybe we can apply a similar Hidden Markov Model for documents
and markup analysis? If I had a grant I'd have time to write out a
solution, but unfortunately it won't fit in the margin of this email.
:)
On the other hand, what about a simpler approach to analyzing and
using markup: what if Google were to do nothing more than to allow you
to search for your words/phrases *only* within particular element
types? No knowledge if what the elements mean, maybe even no knowledge
of what schema or namespace. Just use exactly the same code they use
to support "site:" and other prefixes. All of a sudden you
can do some amazing things with XML data, and you get some help with
HTML, too. Yes, it's badly broken and inadequate in a bunch of ways --
very much like URIs, which are equally broken but have served
admirably anyway. It would also motivate use of markup and markup
standardization big-time.
Now *there's* something completely different. Not because it's
hard or brilliant -- but because, like TimBL's original Web idea, it
would simply ignore the really tough problem of solving semantics and
the cases we know won't work right, and just get on with it. Which for
some purpose (not missile-targeting, please!) is fine.
Steve
--
Luthien Consulting: Real solutions to hard information management
problems
Specializing in XML, schema design, XSLT, and project
design/review/repair
Steven J. DeRose, Ph.D., sderose@acm.org
|