Lists Home |
Date Index |
Michael Champion wrote:
>We had a classically xml-devish thread back in October about the
>implications of Shannon's information theory for XML. I must say I
>didn't understand much of that thread, but Kurt Cagle has an
>intriguing entry in his weblog
>puts forth some ideas that seem both interesting and somewhat
>"Entropy is important because it can better clarify the domain at
>which it is best to work with a given document. XQuery I think
>provides a good case in point here. XQuery supports XPath, and so it
>has some of the advantages that XSLT has, but it's not really all that
>useful for dealing with documents -- converting a DocBook document
>into WordML or vice versa would be impossible in XQuery, but for many
>business schemas with comparatively low entropies, XSLT is definitely
> I for one like the idea of his interpretation of the entropy of an
>XML document in terms of the number of discrete states that its
>(implicit or explicit?) schema allows. I also like the idea that
>certain tools are more or less appropriate depending on the entropy of
>the documents being processed -- perhaps it's something like SAX and
>DOM for low entropy, XQuery for medium entropy, and XSLT for high
>entropy (very document-ish) documents. I wonder, however, about the
>assertions made for the appropriateness of XQuery and XSLT, e.g.
>"converting a DocBook document into WordML or vice versa would be
>impossible in XQuery". It gets back into our XSLT vs XQuery
>permathread -- do the two have radically different capabilities with
>respect to handling recursive structures and/or recursive alorithms,
>or are they more or less different syntaxes for the same capabilities?
>Thoughts, anyone? Sorry to reopen the permathread, but I think
>Kurt's approach might lead to a more focused and possibly conclusive
>discussion, Maybe wwe can all can trade ideas about this with our
>relatives over the holidays :-)
i think cagle's interpretation is fundamentally wrong. (and i may be at
odd slightly with shannon too, but i don't think so).
the number of possible states is not the entropy of a particular
message. and in cagle's example of the two bit integer he is talking
about the limit of the number of possible states as entropy increases.
arguments about the smallest program etc aside, a message is a subset
of the possible states. and it's the message that has entropy, not the
possible states. while the message can be described simply it has low
entropy, if it can't be described simply it has high entropy - relative
to the total number of possible states. (remember that's how we're
trying to find the aliens - very low entropy must be simple physical
things, very high is just background noise, but something in between we
postulate must come from a living source).
the entropy of a message is not a function of the complexity of a
schema, but the information content (the CDATA section that cagle chose
to ignore) within the schema. the schema adds structure and therefore a
minimal amount of information. the cdata adds all the real information.
so the total number of possible states is determined not by the schema
but by the allowed content of the messages. if a schema allows any
number of repitions of even one element, or the content of one element
to be unbounded then the total number of states of the document is
theoretically not limited. if everything is bounded then you can set an
upper limit on the number of states.
i suspect most schemas out there actually fall into the former category.
but as these are open systems we can inject some energy into the system
by way of improved schema design and reduce the number of total states
and therefore the entropy of the messages (by getting to the completely
all of this seems to me irrelevent to the idea of recursive (xslt) vs
non-recursive (xquery) processing.
however it's not irrelevent to a discussion on of dom vs sax. dom
processing implies the ability to represent the document dom before
processing which in turn implies the total document size must be
limited. sax on the other hand can keep processing the stream (forever
it also implies that an unbounded schema must therefore be impossible to
validate in the general case.
and finally for those who like playing with pipes you could establish
network plumbing from input processes taking an infinite stream,
converting to a schema, and passing to sax processors that can consume
infinite messages, identifying bounded components and passing them off
to either other sax or dom based processes for further processing.
>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>initiative of OASIS <http://www.oasis-open.org>
>The list archives are at http://lists.xml.org/archives/xml-dev/
>To subscribe or unsubscribe from this list use the subscription
tel;cell:+61 411 287 530