xml-dev - Binary content and allowed characters in XML

Binary content and allowed characters in XML

[ Lists Home | Date Index | Thread Index ]

To: "'xml-dev@lists.xml.org'" <xml-dev@lists.xml.org>
Subject: Binary content and allowed characters in XML
From: Nicolas LEHUEN <nicolas.lehuen@ubicco.com>
Date: Thu, 20 Dec 2001 10:34:24 +0100
Reply-by: Thu, 20 Dec 2001 11:00:00 +0100

I don't think readability alone is a sufficient reason to forbid binary
content from appearing in an XML document.

What defines the set of allowed characters in XML content ? Is it technical
reasons, or readability reasons ?

Technical reasons are related to the need of simplicity of implementation of
XML parsers. XML parsers should be allowed to follow a very simple set of
states and rules, being implemented as finite-state automata with very few
states.

This means, for example, that you have to forbid certain delimiter
characters from appearing in names, attribute values or text nodes, the best
example being '<'. We could remove this limitation with various tricks, but
this would complicate the parser and/or the serializer.

I think technical reasons alone forbid a very small set of characters from
appearing in the content : it may be limited to '<', whitespaces and
quotation marks, depending on the state of the parser (e.g. in text nodes
whitespaces and quotation marks are allowed).

Moreover, the fact that some character have to be forbidden is simply due to
the fact that XML parsing uses delimiter characters ; there are other ways
of encoding XML-like labeled trees that do not use delimiter characters. For
example, we could encode the length of each content string instead of
marking its end by a state-dependent stop character. I'm not sure this would
complicate the parser. This way, strings could be composed of any arbitrary
byte sequence, which would mean that we could encode text as well as binary
data in XML, at the cost of readability (no one wants to read all those
bytes encoding the length of strings).

The problem is, readability is a subjective concept. There's no character
encoding that both contains all required characters for a given language and
that are readable on all platform. If you define readability by "I can read
it with vi under my Unix variant", you'll have hard times trying to find
such an encoding.

Forget about english-centric character encoding. Lots of people have to
encode content with weird accentuated characters (I'm French so I know a bit
about that), and many more people don't have a 26 character alphabet, but
bunches of ideograms. Chances are that an XML file containing Kanji
characters in not readable on vi running on whatever Unix variant.

How do you define readability in this context ? How do you justify the fact
that you cannot directly embed binary data within an XML document, whereas
Kanji text would look to me as binary data, on my occidental computer ?

I don't think the readability of the serialized form is so important. What
matters to me is the fact that I can correctly exchange labeled trees while
keeping the serialization/parsing process simple and platform-independent.

XML answers this need, as long as labels and values are 'text', whatever it
means for the W3C (once again, UTF-8 string is no longer text for me as soon
as I've got French accentuated characters in my strings). But for binary
data, I have to use tricks. I don't want to.

Regards,
Nicolas Lehuen

Follow-Ups:
- Re: [xml-dev] Binary content and allowed characters in XML
  - From: Gavin Thomas Nicol <gtn@rbii.com>
- Lexical vs value spaces (re: Binary content and allowed characters in XML)
  - From: Eric van der Vlist <vdv@dyomedea.com>

Prev by Date: Re: [xml-dev] XML and mainframes, yet again
Next by Date: RE: [xml-dev] Some comments on the 1.1 draft
Previous by thread: Time to put SGML out to pasture? (was RE: [xml-dev] Why would MS want to make XML break on UNIX, Perl, Python, etc ?)
Next by thread: Lexical vs value spaces (re: Binary content and allowed characters in XML)
Index(es):
- Date
- Thread