[
Lists Home |
Date Index |
Thread Index
]
----- Original Message -----
From: "Bullard, Claude L (Len)" <len.bullard@intergraph.com>
To: "'TAN Kuan Hui'" <kuanhui@xemantics.com>; "Roger L. Costello"
<costello@mitre.org>; <xml-dev@lists.xml.org>
Sent: Tuesday, October 12, 2004 10:16 PM
Subject: RE: [xml-dev] [Shannon: information ~ uncertainty] Ramifications to
XML data exchange?
> Correct. Discussions of Markov models are appropriate.
>
Agree, temporal relevancy is very useful. Entropy varies with
time as information fluctuates accordingly, but that makes
system modeling interestingly complex and dynamic.
If google returns results with similar relevancy factoring in
datetime, IMO, it will definitely improve the information value
of a query. Gyrating huge databases to sync temporal
relevancy will be challenging.
So understanding Shannon theory w.r.t. XML data and
returning interactions with that data with greater relevancy
and info. value to the user is a useful discussion.
Generic modelling of relevancy w.r.t. to domain specific
vocabs and XML schemas will be interesting aka
semantic prediction w.r.t domain vocabs for query purposes.
>
> Analysis of frequency of letters is applied to text categorization,
> and other pattern-based analysis used for prediction. Imagine a
> tool that scans texts and based on this analysis, creates a
> schematic description of the frequencies of occurrence of
> some set of categorical types. Would that output be close
> or equivalent to a DTD or Schema? Is a DTD/Schema a pattern
> generated by a learning/negotiation process?
>
I think we are confusing the "structure and syntax of information"
versus the "content and semantics of information" which IMO
dictates the value of the information. The former only provides
a structured set of vocabs for communication.
Aside, this is why seaching stop words are not very useful
and relevancy improves with (eg) phrase searches; the longer the
phrase, the lower its probability of occurrence, the higher
the relevancy of a query, the higher the value the query.
The lesser the redundancy in an XML data set, the
higher therefore must be its information value.
> In the following, I will summarize from
>
> http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.html
>
> The three principle applications of Markov modeling are:
>
> o Evaluation model: Discovering the probability of an observable sequence
> (apply the forward algorithm)
> o Decoding model: Discovering the sequence of hidden states that create
the
> observable
> state (apply the Viterbi algorithm)
> o Learning model: Given an observable state, discovering the hidden
states
>
> (apply the forward/backward algorithm)
>
> In fact, the majority of texts we exchange are not random and
> all choices in the Shannon sense are not equally probable. They
> are 'meaningful'. Understanding how texts acquire the property
> of meaning infers one understands how multiple systems, even
> ones where within each system some choices are equally probable
> (non-deterministic) and some are not (relative determinism)
> when interacting reduce or increase entropy.
>
Statistically, multiple systems' interactions should aggregate up
and can still be modelled as a single black box.
> Determinism varies system by system. The arrow of time does not
> in and of itself produce steady increases in entropy. Only
Temporal correlations are usually higher with near term events
versus those that has occurred some time ago. Entropy therefore
should be higher when evaluated with reference to recent
data.
> thermodynamically
> isolated systems fit that model. A system interoperating
> with other systems and exchanging energy changes that outcome.
> A Markov model assumes we can predict a future state based
> on past states.
>
The XML Schema gives us a starting point in that model.
|