OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Lexical vs value spaces (re: Binary content and allowed characters in XM

[ Lists Home | Date Index | Thread Index ]


Nicolas LEHUEN wrote:

> I don't think readability alone is a sufficient reason to forbid binary
> content from appearing in an XML document.

I agree: there is a much better and simpler reason to forbid binary 
content from XML documents ;=) ...

> What defines the set of allowed characters in XML content ? Is it technical
> reasons, or readability reasons ?

IMO, none of them, but rather a fundamental design decision: a XML 
entity is a Unicode text (eventually using another encoding) and not a 
stream of bytes.

This should be a sufficient reason to close the debate IMO!

The problem with including arbitrary binary content would not so much be 
the "control characters", but the fact that the physical value of this 
content read as bytes would change depending on the encoding used for 
the document (what if I save it as utf-16 while it has been created as 

We are using a layered model where XML is built on Unicode and that 
would be a short-circuit of the lower level...

That being said, this doesn't seem to be a problem to use XML as a 
serialization format for integers, float or dates, why should it be for 
binary data?

The trick is just to realize that, to take a notion which I find very 
useful in W3C XML Schema, there is a decoupling between lexical and 
value spaces and to define the best lexical space for the binary content 
you want to serialize.

For arbitrary binary data, hex or base64 seem to be obvious choices but 
for data which is "almost text" with special "things" embedded, other 
solutions can be found.

One of them is to serialize the "things" found in the text as elements 
(and you have then a mixed content), the other is to define a specific 
lexical space for them (like "=00" or whatever). Which one you want to 
use comes back to the debate of using structured values in elements or 

I think that it's important to realize that the cases where the lexical 
and value spaces are identical are fairly uncommon (except in the 
"document" world) and that for a vast majority of datatypes a coding 
needs to be performed and these spaces are different.

BTW, when you think about it, this decoupling goes beyond XML world... 
In Europe, the Euro has already been there for a couple of years and 
what will happen in 12 days is just an harmonization of the many lexical 
spaces to cut the processing costs ;=) ...

Rendez-vous a Paris pour les Electronic Business Days 2002.
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
http://xsltunit.org      http://4xt.org           http://examplotron.org


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS