OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: [xml-dev] Something altogether different?

[ Lists Home | Date Index | Thread Index ]

Title: RE: [xml-dev] Something altogether different?
At 11:58 -0500 2005-04-22, Bullard, Claude L (Len) wrote:
1)  Can processes be reliable given noisy data?

Of course -- just have to read Claude Shannon. Though here's one interesting bit about noise I just ran into: At http://www.maths.ex.ac.uk/~mwatkins/zeta/surprising.htm

>Indirectly, as a result of studying nonlinear dynamics Marek Wolf discovered two instances of apparent fractality within the distribution of prime numbers ([W2-3]). These discoveries were realised experimentally using powerful computers. Wolf's resulting interest in the distribution of the primes led him to experimentally discover  the presence of 1/f [pink] noise when the  primes are treated as a 'signal' in the sense of information theory ([W4]). This is also a self-similar (scale invariant, or fractal) property of the distribution of primes.

Connecting noise/information theory to the Reimann hypothesis -- now there's "Something altogether different", especially disruptive because trapdoor encryption methods depend on our not knowing how to find prime factors fast enough.... Oops....

I suggest a review of the works of Salton et al on
the vector space model, and the new refinements of
Dominick Kuropka et al on topic-based vector space
models.  Consider these in terms of namespaces as
provided by XML, and the implications given aggregate

I once spent a while working on the idea of incorporating markup into Salton-like metrics. The problem I ran into was that Salton's stuff is working solely at document-level, so even the fact that two words are merely at opposite ends of the document (versus being adjacent, for example) doesn't enter in. So markup giving you finer distinctions of co-relevance wouldn't help. First you have to find how to apply Salton-ish methods to finer-grained objects, which is not trivial. There are a couple papers on that, but last I looked, nothing very effective. To use markup well for this, it seems like you have to know something about its semantics -- which is hard, but maybe avoidable.

In AI, a similar issue of accuracy vs. speed/simplicity/scalability in the face of noise and ambiguity was solved in the late 80's. Turns out, it's hard to assign the right part of speech to words. Almost everything is ambiguous (like "dog" can be a verb). Linguists had shown that there are cases where you *cannot* determine which way a word is functioning without knowing the whole semantics -- the reliability issue again. But getting the whole semantics is a lot of work, especially if you haven't figured out the part of speech yet. Ken Church and I showed in 1987 that you could get *better* reliability with purely statistical methods that ignored semantic questions. Yes, we got the "proof" cases wrong -- but we did better overall, and the method was practical (about O(ln N) instead of O(N**3), for any geeks among us). Now part-of-speech is nearly always done that way.

Maybe we can apply a similar Hidden Markov Model for documents and markup analysis? If I had a grant I'd have time to write out a solution, but unfortunately it won't fit in the margin of this email. :)

On the other hand, what about a simpler approach to analyzing and using markup: what if Google were to do nothing more than to allow you to search for your words/phrases *only* within particular element types? No knowledge if what the elements mean, maybe even no knowledge of what schema or namespace. Just use exactly the same code they use to support "site:" and other prefixes. All of a sudden you can do some amazing things with XML data, and you get some help with HTML, too. Yes, it's badly broken and inadequate in a bunch of ways -- very much like URIs, which are equally broken but have served admirably anyway. It would also motivate use of markup and markup standardization big-time.

Now *there's* something completely different. Not because it's hard or brilliant -- but because, like TimBL's original Web idea, it would simply ignore the really tough problem of solving semantics and the cases we know won't work right, and just get on with it. Which for some purpose (not missile-targeting, please!) is fine.


Luthien Consulting: Real solutions to hard information management problems
   Specializing in XML, schema design, XSLT, and project design/review/repair
Steven J. DeRose, Ph.D., sderose@acm.org


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS