Lists Home |
Date Index |
Title: RE: [xml-dev] Something altogether different?
million if you can prove the Riemann hypothesis, and the eternal displeasure of
cryptogeeks, but you definitely get your own personal footnote in the math
the system can be reliable in the face of noise if it is affordable, and that
vs requirement. Speed is money; how fast can you afford to go.
I don't dispute
what Bosworth is talking about will work. Weaken the measurements and you
fit the moon inside a bag... theoretically.
would suspect that the markup-search-inside approach yields results the way
inside rows does. It's grouped apriori. Also, doesn't the
fact of a GI
context of other GIs reduce ambiguity (ummm... sure)? Cosmic
BTW: Kuropka goes beyond Salton by not assuming independent
what you mean about co-relevance and the document level.
the generality of a URI namespace as a topical name and that some
names occur closer or at higher
frequencies than other names thus clustering among the vectors
that caught my attention.
topics are vector spaces, and topics are grouped, a tensor product can be used
the the vectors. All qubits are vectors and tensors can be used to group
Abstract topics regardless of the kind of expression used (eg, HTML vs
X3D or SVG) should
the same vector values. The vector product is another kind of address
as we discussed
Useful? I can't tell. It's intuitively appealing. If a
schema circumscribes the topicality of a
document, it is a tensor product of qubits. My math is too
deficient to get past the intuition.
Awfully glad you are back, Steve.
At 11:58 -0500 2005-04-22, Bullard, Claude L (Len) wrote:
1) Can processes be reliable given
Of course -- just have to read Claude Shannon. Though
here's one interesting bit about noise I just ran into: At
>Indirectly, as a result of studying nonlinear dynamics Marek Wolf
discovered two instances of apparent fractality within the distribution of
prime numbers ([W2-3]). These discoveries were realised experimentally using
powerful computers. Wolf's resulting interest in the distribution of the
primes led him to experimentally discover the presence of 1/f [pink]
noise when the primes are treated as a 'signal' in the sense of
information theory ([W4]). This is also a self-similar (scale invariant, or
fractal) property of the distribution of primes.
Connecting noise/information theory to the Reimann hypothesis -- now
there's "Something altogether different", especially disruptive because
trapdoor encryption methods depend on our not knowing how to find prime
factors fast enough.... Oops....
I suggest a review of the works of Salton et
the vector space model, and the new refinements of
Kuropka et al on topic-based vector space
models. Consider these in
terms of namespaces as
provided by XML, and the implications given
I once spent a while working on the idea of incorporating markup into
Salton-like metrics. The problem I ran into was that Salton's stuff is working
solely at document-level, so even the fact that two words are merely at
opposite ends of the document (versus being adjacent, for example) doesn't
enter in. So markup giving you finer distinctions of co-relevance wouldn't
help. First you have to find how to apply Salton-ish methods to finer-grained
objects, which is not trivial. There are a couple papers on that, but last I
looked, nothing very effective. To use markup well for this, it seems like you
have to know something about its semantics -- which is hard, but maybe
In AI, a similar issue of accuracy vs. speed/simplicity/scalability in
the face of noise and ambiguity was solved in the late 80's. Turns out, it's
hard to assign the right part of speech to words. Almost everything is
ambiguous (like "dog" can be a verb). Linguists had shown that there are cases
where you *cannot* determine which way a word is functioning without knowing
the whole semantics -- the reliability issue again. But getting the whole
semantics is a lot of work, especially if you haven't figured out the part of
speech yet. Ken Church and I showed in 1987 that you could get *better*
reliability with purely statistical methods that ignored semantic questions.
Yes, we got the "proof" cases wrong -- but we did better overall, and the
method was practical (about O(ln N) instead of O(N**3), for any geeks among
us). Now part-of-speech is nearly always done that way.
Maybe we can apply a similar Hidden Markov Model for documents and markup
analysis? If I had a grant I'd have time to write out a solution, but
unfortunately it won't fit in the margin of this email. :)
On the other hand, what about a simpler approach to analyzing and using
markup: what if Google were to do nothing more than to allow you to search for
your words/phrases *only* within particular element types? No knowledge if
what the elements mean, maybe even no knowledge of what schema or namespace.
Just use exactly the same code they use to support "site:" and other prefixes.
All of a sudden you can do some amazing things with XML data, and you get some
help with HTML, too. Yes, it's badly broken and inadequate in a bunch of ways
-- very much like URIs, which are equally broken but have served admirably
anyway. It would also motivate use of markup and markup standardization
Now *there's* something completely different. Not because it's hard or
brilliant -- but because, like TimBL's original Web idea, it would simply
ignore the really tough problem of solving semantics and the cases we know
won't work right, and just get on with it. Which for some purpose (not
missile-targeting, please!) is fine.
Luthien Consulting: Real solutions to hard information management
Specializing in XML, schema design, XSLT, and project
Steven J. DeRose, Ph.D.,