[
Lists Home |
Date Index |
Thread Index
]
Hi,
Nicolas LEHUEN wrote:
> I don't think readability alone is a sufficient reason to forbid binary
> content from appearing in an XML document.
I agree: there is a much better and simpler reason to forbid binary
content from XML documents ;=) ...
>
> What defines the set of allowed characters in XML content ? Is it technical
> reasons, or readability reasons ?
IMO, none of them, but rather a fundamental design decision: a XML
entity is a Unicode text (eventually using another encoding) and not a
stream of bytes.
This should be a sufficient reason to close the debate IMO!
The problem with including arbitrary binary content would not so much be
the "control characters", but the fact that the physical value of this
content read as bytes would change depending on the encoding used for
the document (what if I save it as utf-16 while it has been created as
utf-8).
We are using a layered model where XML is built on Unicode and that
would be a short-circuit of the lower level...
That being said, this doesn't seem to be a problem to use XML as a
serialization format for integers, float or dates, why should it be for
binary data?
The trick is just to realize that, to take a notion which I find very
useful in W3C XML Schema, there is a decoupling between lexical and
value spaces and to define the best lexical space for the binary content
you want to serialize.
For arbitrary binary data, hex or base64 seem to be obvious choices but
for data which is "almost text" with special "things" embedded, other
solutions can be found.
One of them is to serialize the "things" found in the text as elements
(and you have then a mixed content), the other is to define a specific
lexical space for them (like "=00" or whatever). Which one you want to
use comes back to the debate of using structured values in elements or
attributes.
I think that it's important to realize that the cases where the lexical
and value spaces are identical are fairly uncommon (except in the
"document" world) and that for a vast majority of datatypes a coding
needs to be performed and these spaces are different.
BTW, when you think about it, this decoupling goes beyond XML world...
In Europe, the Euro has already been there for a couple of years and
what will happen in 12 days is just an harmonization of the many lexical
spaces to cut the processing costs ;=) ...
Eric--
Rendez-vous a Paris pour les Electronic Business Days 2002.
http://www.edifrance.org/ebd/index.htm
------------------------------------------------------------------------
Eric van der Vlist http://xmlfr.org http://dyomedea.com
http://xsltunit.org http://4xt.org http://examplotron.org
------------------------------------------------------------------------
|