OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] XML and entropy, again

[ Lists Home | Date Index | Thread Index ]

Michael Champion wrote:

>We had a  classically xml-devish thread back in October about the
>implications of Shannon's information theory for XML.  I must say I
>didn't understand  much of that thread, but Kurt Cagle has an
>intriguing entry in his weblog
>http://metaphoricalweb.blogspot.com/2004/12/xml-and-entropy.html that
>puts forth some ideas that seem both interesting and somewhat
>practical.
>
>"Entropy is important because it can better clarify the domain at
>which it is best to work with a given document. XQuery I think
>provides a good case in point here. XQuery supports XPath, and so it
>has some of the advantages that XSLT has, but it's not really all that
>useful for dealing with documents -- converting a DocBook document
>into WordML or vice versa would be impossible in XQuery, but for many
>business schemas with comparatively low entropies, XSLT is definitely
>overkill."
>
> I for one like the idea of his interpretation of the entropy of an
>XML document in terms of the number of discrete states that its
>(implicit or explicit?) schema allows.  I also like the idea that
>certain tools are more or less appropriate depending on the entropy of
>the documents being processed -- perhaps it's something like SAX and
>DOM for low entropy, XQuery for medium entropy, and XSLT for high
>entropy (very document-ish) documents.   I wonder, however, about the
>assertions made for the appropriateness of XQuery and XSLT, e.g.
>"converting a DocBook document into WordML or vice versa would be
>impossible in XQuery".  It gets back into our XSLT vs XQuery
>permathread -- do the two have radically different capabilities with
>respect to handling recursive structures and/or recursive alorithms,
>or are they more or less different syntaxes for the same capabilities?
>
>Thoughts, anyone?   Sorry to reopen the permathread, but I think
>Kurt's approach might lead to a more focused and possibly conclusive
>discussion,  Maybe wwe can all can trade ideas about this with our
>relatives over the holidays :-)
>
>  
>
i think cagle's interpretation is fundamentally wrong. (and i may be at 
odd slightly with shannon too, but i don't think so).

the number of possible states is not the entropy of a particular 
message. and in cagle's example of the two bit integer he is talking 
about the limit of the number of possible states as entropy increases. 
arguments about the smallest program etc aside, a message is a subset 
of  the possible states. and it's the message that has entropy, not the 
possible states. while the message can be described simply it has low 
entropy, if it can't be described simply it has high entropy - relative 
to the total number of possible states. (remember that's how we're 
trying to find the aliens - very low entropy must be simple physical 
things, very high is just background noise, but something in between we 
postulate must come from a living source).

the entropy of a message is not a function of the complexity of a 
schema, but the information content (the CDATA section that cagle chose 
to ignore) within the schema. the schema adds structure and therefore a 
minimal amount of information. the cdata adds all the real information.

so the total number of possible states is determined not by the schema 
but by the allowed content of the messages. if a schema allows any 
number of repitions of even one element, or the content of one element 
to be unbounded then the total number of states of the document is 
theoretically not limited. if everything is bounded then you can set an 
upper limit on the number of states.

i suspect most schemas out there actually fall into the former category. 
but as these are open systems we can inject some energy into the system 
by way of improved schema design and reduce the number of total states 
and therefore the entropy of the messages (by getting to the completely 
bounded state).

all of this seems to me irrelevent to the idea of recursive (xslt) vs 
non-recursive (xquery) processing.

however it's not irrelevent to a discussion on of dom vs sax. dom 
processing implies the ability to represent the document dom before 
processing which in turn implies the total document size must be 
limited. sax on the other hand can keep processing the stream (forever 
if necessary).

it also implies that an unbounded schema must therefore be impossible to 
validate in the general case.

and finally for those who like playing with pipes you could establish 
network plumbing from input processes taking an infinite stream, 
converting to a schema, and passing to sax processors that can consume 
infinite messages, identifying bounded components and passing them off 
to either other sax or dom based processes for further processing.

rick

>-----------------------------------------------------------------
>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>initiative of OASIS <http://www.oasis-open.org>
>
>The list archives are at http://lists.xml.org/archives/xml-dev/
>
>To subscribe or unsubscribe from this list use the subscription
>manager: <http://www.oasis-open.org/mlmanage/index.php>
>
>  
>

begin:vcard
fn:Rick  Marshall
n:Marshall;Rick 
email;internet:rjm@zenucom.com
tel;cell:+61 411 287 530
x-mozilla-html:TRUE
version:2.1
end:vcard





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS