xml-dev - RE: [xml-dev] Lexical vs value spaces (re: Binary content and allowed ch

RE: [xml-dev] Lexical vs value spaces (re: Binary content and allowed ch

[ Lists Home | Date Index | Thread Index ]

To: 'Eric van der Vlist' <vdv@dyomedea.com>, "'xml-dev@lists.xml.org'" <xml-dev@lists.xml.org>
Subject: RE: [xml-dev] Lexical vs value spaces (re: Binary content and allowed characters in XML)
From: Nicolas LEHUEN <nicolas.lehuen@ubicco.com>
Date: Thu, 20 Dec 2001 16:04:04 +0100

>IMO, none of them, but rather a fundamental design decision: a XML 
>entity is a Unicode text (eventually using another encoding) and not a 
>stream of bytes.
>
>This should be a sufficient reason to close the debate IMO!

OK, that's the good reason I was waiting for :). I was kind of playing the
devil's advocate here, but without knowing the proper answer :P.

>The problem with including arbitrary binary content would not 
>so much be 
>the "control characters", but the fact that the physical value of this 
>content read as bytes would change depending on the encoding used for 
>the document (what if I save it as utf-16 while it has been created as 
>utf-8).
>
>We are using a layered model where XML is built on Unicode and that 
>would be a short-circuit of the lower level...

Agreed. There would be no way for the parser to distinguish before text
content and binary content, so we could expect that the parser tries to
decode our binary content as encoded Unicode strings, which would lead us to
nonsense. I get it, now.

>That being said, this doesn't seem to be a problem to use XML as a 
>serialization format for integers, float or dates, why should 
>it be for 
>binary data?

Well, some people aren't happy because they can't directly embed binary
content within XML document, but alas, even if they could, they would have
to escape the byte sequence corresponding to '<' in the document encoding,
which sometimes is unknown at the time of document creation (especially if
you use a SAX or DOM API without taking care of the serialization part).

In XML, you just CANNOT embed anything WITHOUT taking care of escaping the
XML control characters, which are '<' and quotation marks, depending on the
current parser/serializer state.

That's a direct consequence of the XML format, which uses delimiter
characters ; that's too bad those delimiters are found in the "useful" set
of characters instead of special control characters, which forces us to
escape even simple text (well, at least technical texts with '<' inside).

When you're working with text, and a Unicode-aware programming language,
escaping is easy, since you compare characters with '<'. If you were
encoding binary data, you would have to compare your data with the result of
the encoding of '<', which is not always known at document building time (in
UTF-16 it would be 0x003C, in UTF-8 and ISO-8859-1 0x3C only).

So, since you're forced to encode your binary content into *characters* (not
bytes) that will then be encoded into bytes according to the character
encoding, why not use Base64 ? Note that there are other solutions which may
be more economic [1].

>The trick is just to realize that, to take a notion which I find very 
>useful in W3C XML Schema, there is a decoupling between lexical and 
>value spaces and to define the best lexical space for the 
>binary content 
>you want to serialize.
>
>For arbitrary binary data, hex or base64 seem to be obvious 
>choices but 
>for data which is "almost text" with special "things" embedded, other 
>solutions can be found.
>
>One of them is to serialize the "things" found in the text as elements 
>(and you have then a mixed content), the other is to define a specific 
>lexical space for them (like "=00" or whatever). Which one you want to 
>use comes back to the debate of using structured values in elements or 
>attributes.
>
>I think that it's important to realize that the cases where 
>the lexical 
>and value spaces are identical are fairly uncommon (except in the 
>"document" world) and that for a vast majority of datatypes a coding 
>needs to be performed and these spaces are different.

So why do people keep on insisting that their XML content be readable with
vi ? Why does it matter so much for people to be able to read XML documents
with non appropriate tools, while we could easily have true XML viewers ?

I could invent a stupid Unicode encoding that would make any XML document
unreadable in vi (for example : U+0123 would be encoded as 0x32 0x10), yet
perfectly correct provided that the parser has the corresponding encoder.
But nobody would like to use it, because they would not be able to read it
in the lexical space... We (human) don't care about the lexical space, it's
the value space that has some meaning !

Regards,
Nicolas

[1] http://www.javaworld.com/javaworld/javatips/jw-javatip117.html

Follow-Ups:
- Re: [xml-dev] Lexical vs value spaces (re: Binary content and allowed characters in XML)
  - From: Ronald Bourret <rpbourret@rpbourret.com>

Prev by Date: Re: [xml-dev] Lexical vs value spaces (re: Binary content and allowed characters in XML)
Next by Date: Re: [xml-dev] Some comments on the 1.1 draft
Previous by thread: RE: [xml-dev] embedded html
Next by thread: Re: [xml-dev] Lexical vs value spaces (re: Binary content and allowed characters in XML)
Index(es):
- Date
- Thread