Lists Home |
Date Index |
The W3C XML Core Working Group (I am not a member, but John Cowan is)
has apparantly been discussing these kinds of issues as part of their XML 1.1 escapade.
Some interesting issues that arise out of that include:
* The RFC on MIME types talks about "textual" data rather than the text/binary distinction.
So control characters in an ASCII file may be "text" to some
but they are not "textual". So the debate is not between XML as text or binary, but whether
XML should be text-with-controls or be textual, in the sense of being legible in a capable vanilla
editor (or being character-by-character speakable in a speech synthesizer for
the locale of that document, I guess.)
* Unicode has changed since 3.0 to allocate by default the ISO 6429 control codes.
In Unicode 2.x which XML 1.0 was based on, the control range was not allocated
to particular points. If XML 1.1 mandates that NEL is U+0080, then it adopts the
ISO 6429 characters: however, some ISO 6429 controls do not have corresponding
Unicode mappings: they are reserved only for lower-layer use, presumably in
legacy systems, such as PAD, 0x80, 0x81, 0x82. Should these characters be
left free and privately defined, or banned.
* For the robustness reasons I give in that note, it is highly desirable that XML
2.0 ban as many C1 (0x80-0x9F) characters as possible. However, the XML Core WG
iseem to be backing themselves into a corner: they say they cannot deprecate or shun the
C1 controls because of XML 1.0 compatability but also that they want to close the
character repertoire issue once and for all--this in effect closes the door on improvements
on repertoire and robustness from XML 2.0: they can only allow supersets (Of course,
every issue can be revisited, so I think they are fooling themselves if they think that
repertoire issues can ever go away.)
From: "Michael Kay" <email@example.com>
> I don't want to dumb XML down. But we do sometimes need to store data (e.g.
> WebDAV property values) which can potentially contain characters that are
> not permitted in XML. In fact, it's very unlikely that a WebDAV property
> value will contain such a character, but the software still needs to allow
> for the possibility.
XML has never been about guaranteed interoperability. Rather, it means that
if you pick a conservative character encoding, and conservative name characters,
and conservative data characters, and only use reliable URIs for system identifiers
and links etc, and send standalone documents, and normalize your document
correctly before you send it, you can expect your data to go through.
Around this core of expectable interoperability is a cloud of regional interoperabilty,
where, say, people in Taiwan probably only use XML systems that support Big5
and people in Uganda probably are happy to use systems whether or not they
support Big 5.
It might be that people who want to exchange UTF-* with control characters
are better off treated as if they are a region. So an XML document with
<?xml version="1.1" encoding="utf-8"?>
would barf if it found a C1 control (for the robustness/mislabelling reasons)
but accept them if it found
<?xml version="1.1" encoding="utf-8-with-controls"?>
<?xml version="1.1" encoding="utf-8" controls="allow" ?>
That has the advantage of moving the issue into being one of labelling rather
than invisible characters, with a safe default. And it would save the XML Core WG
from accusations of favouring Westerners over Asians since non-ASCII users do face
robustness issues that ASCII-repertoire users do not. (And, actually, because of the
Euro issue, this is now more like [English, Bahasa]-users versus non-[English, Bahasa]
> I don't personally see any good reason why C0 (and C1) characters shouldn't
> be permitted XML characters, with the restriction that they must be written
> as character references.
There is no reason from SGML compatability. It would be merely an additional requirement
for XML that the particular characters are only referenced not used directly,
and something that serializers should attend to.
Another alternative is to define built-in named character references (i.e. like <)
based on the actual control characters: so people can type
<p>blah&BEL;blah&EOT;blah</p> Of course, it is likely that people who want
to send information using C1 controls are not actually using the ISO 6429 characters:
they are using the characters for some private, proprietary or nefarious purpose
rather than the public, resolved, robust, safe data interchange for which XML was created.
So named character references would not really answer their needs.
 One point about XML being textual or not is whether you need an API to
access/create/read the data or not. If you need an API, then the issue arises
who controls the API?