[
Lists Home 
Date Index 
Thread Index
]
Title: RE: [xmldev] Something altogether different?
A cool
million if you can prove the Riemann hypothesis, and the eternal displeasure of
some
cryptogeeks, but you definitely get your own personal footnote in the math
books.
Yes,
the system can be reliable in the face of noise if it is affordable, and that
is
cost
vs requirement. Speed is money; how fast can you afford to go.
I don't dispute
that
what Bosworth is talking about will work. Weaken the measurements and you
can
fit the moon inside a bag... theoretically.
I
would suspect that the markupsearchinside approach yields results the way
searching
values
inside rows does. It's grouped apriori. Also, doesn't the
fact of a GI
in the
context of other GIs reduce ambiguity (ummm... sure)? Cosmic
d'oh.
BTW: Kuropka goes beyond Salton by not assuming independent
terms.
I see
what you mean about corelevance and the document level.
It is
the generality of a URI namespace as a topical name and that some
names occur closer or at higher
frequencies than other names thus clustering among the vectors
that caught my attention.
If
topics are vector spaces, and topics are grouped, a tensor product can be used
to
group
the the vectors. All qubits are vectors and tensors can be used to group
these.
Abstract topics regardless of the kind of expression used (eg, HTML vs
X3D or SVG) should
have
the same vector values. The vector product is another kind of address
as we discussed
in the
Hytime era.
Useful? I can't tell. It's intuitively appealing. If a
schema circumscribes the topicality of a
document, it is a tensor product of qubits. My math is too
deficient to get past the intuition.
Awfully glad you are back, Steve.
len
At 11:58 0500 20050422, Bullard, Claude L (Len) wrote:
1) Can processes be reliable given
noisy data
Of course  just have to read Claude Shannon. Though
here's one interesting bit about noise I just ran into: At
http://www.maths.ex.ac.uk/~mwatkins/zeta/surprising.htm
>Indirectly, as a result of studying nonlinear dynamics Marek Wolf
discovered two instances of apparent fractality within the distribution of
prime numbers ([W23]). These discoveries were realised experimentally using
powerful computers. Wolf's resulting interest in the distribution of the
primes led him to experimentally discover the presence of 1/f [pink]
noise when the primes are treated as a 'signal' in the sense of
information theory ([W4]). This is also a selfsimilar (scale invariant, or
fractal) property of the distribution of primes.
Connecting noise/information theory to the Reimann hypothesis  now
there's "Something altogether different", especially disruptive because
trapdoor encryption methods depend on our not knowing how to find prime
factors fast enough.... Oops....
...
I suggest a review of the works of Salton et
al on the vector space model, and the new refinements of Dominick
Kuropka et al on topicbased vector space models. Consider these in
terms of namespaces as
provided by XML, and the implications given
aggregate
...
I once spent a while working on the idea of incorporating markup into
Saltonlike metrics. The problem I ran into was that Salton's stuff is working
solely at documentlevel, so even the fact that two words are merely at
opposite ends of the document (versus being adjacent, for example) doesn't
enter in. So markup giving you finer distinctions of corelevance wouldn't
help. First you have to find how to apply Saltonish methods to finergrained
objects, which is not trivial. There are a couple papers on that, but last I
looked, nothing very effective. To use markup well for this, it seems like you
have to know something about its semantics  which is hard, but maybe
avoidable.
In AI, a similar issue of accuracy vs. speed/simplicity/scalability in
the face of noise and ambiguity was solved in the late 80's. Turns out, it's
hard to assign the right part of speech to words. Almost everything is
ambiguous (like "dog" can be a verb). Linguists had shown that there are cases
where you *cannot* determine which way a word is functioning without knowing
the whole semantics  the reliability issue again. But getting the whole
semantics is a lot of work, especially if you haven't figured out the part of
speech yet. Ken Church and I showed in 1987 that you could get *better*
reliability with purely statistical methods that ignored semantic questions.
Yes, we got the "proof" cases wrong  but we did better overall, and the
method was practical (about O(ln N) instead of O(N**3), for any geeks among
us). Now partofspeech is nearly always done that way.
Maybe we can apply a similar Hidden Markov Model for documents and markup
analysis? If I had a grant I'd have time to write out a solution, but
unfortunately it won't fit in the margin of this email. :)
On the other hand, what about a simpler approach to analyzing and using
markup: what if Google were to do nothing more than to allow you to search for
your words/phrases *only* within particular element types? No knowledge if
what the elements mean, maybe even no knowledge of what schema or namespace.
Just use exactly the same code they use to support "site:" and other prefixes.
All of a sudden you can do some amazing things with XML data, and you get some
help with HTML, too. Yes, it's badly broken and inadequate in a bunch of ways
 very much like URIs, which are equally broken but have served admirably
anyway. It would also motivate use of markup and markup standardization
bigtime.
Now *there's* something completely different. Not because it's hard or
brilliant  but because, like TimBL's original Web idea, it would simply
ignore the really tough problem of solving semantics and the cases we know
won't work right, and just get on with it. Which for some purpose (not
missiletargeting, please!) is fine.
Steve

Luthien Consulting: Real solutions to hard information management
problems Specializing in XML, schema design, XSLT, and project
design/review/repair Steven J. DeRose, Ph.D.,
sderose@acm.org
