xml-dev - Re: [xml-dev] MSXML DOM Special Chars Less Than 32

Re: [xml-dev] MSXML DOM Special Chars Less Than 32

[ Lists Home | Date Index | Thread Index ]

To: <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] MSXML DOM Special Chars Less Than 32
From: "Rick Jelliffe" <ricko@allette.com.au>
Date: Sun, 24 Mar 2002 17:54:05 +1100
References: <001701c1d18d$36247f40$655169d5@pcukmka>

The W3C XML Core Working Group (I am not a member, but John Cowan is) 
has apparantly been discussing these kinds of issues as part of their XML 1.1 escapade.

Some interesting issues that arise out of that include:

   * The RFC on MIME types talks about "textual" data rather than the text/binary distinction.
      So control characters in an ASCII file may be "text" to some
      but they are not "textual".[1] So the debate is not between XML as text or binary, but whether 
      XML should be text-with-controls or be textual, in the sense of being legible in a capable vanilla
      editor (or being character-by-character speakable in a speech synthesizer for
      the locale of that document, I guess.)

   * Unicode has changed since 3.0 to allocate by default the ISO 6429 control codes.
      In Unicode 2.x which XML 1.0 was based on, the control range was not allocated
      to particular points.  If XML 1.1 mandates that NEL is U+0080, then it adopts the
      ISO 6429 characters: however, some ISO 6429 controls do not have corresponding
      Unicode mappings: they are reserved only for lower-layer use, presumably in 
       legacy systems, such as PAD, 0x80, 0x81, 0x82.   Should these characters be
       left free and privately defined, or banned.

    * For the robustness reasons I give in that note, it is highly desirable that XML
      2.0 ban as many C1 (0x80-0x9F) characters as possible.  However, the XML Core WG
      iseem to be backing themselves into a corner: they say they cannot deprecate or shun the
      C1 controls because of XML 1.0 compatability but also that they want to close the 
      character repertoire issue once and for all--this in effect closes the door on improvements
      on repertoire and robustness from XML 2.0: they can only allow supersets  (Of course, 
      every issue can be revisited, so I think they are fooling themselves if they think that 
      repertoire issues can ever go away.)

From: "Michael Kay" <michael.h.kay@ntlworld.com>

 > I don't want to dumb XML down. But we do sometimes need to store data (e.g.
> WebDAV property values) which can potentially contain characters that are
> not permitted in XML. In fact, it's very unlikely that a WebDAV property
> value will contain such a character, but the software still needs to allow
> for the possibility.

XML has never been about guaranteed interoperability.  Rather, it means that
if you pick a conservative character encoding, and conservative name characters,
and conservative data characters, and only use reliable URIs for system identifiers
and links etc, and send standalone documents, and normalize your document
correctly before you send it, you can expect your data to go through.

Around this core of expectable interoperability is a cloud of regional interoperabilty,
where, say, people in Taiwan probably only use XML systems that support Big5
and people in Uganda probably are happy to use systems whether or not they
support Big 5. 

It might be that people who want to exchange UTF-* with control characters
are better off treated as if they are a region.  So an XML document with
 <?xml version="1.1" encoding="utf-8"?>
would barf if it found a C1 control (for the robustness/mislabelling reasons)
but accept them if it found
 <?xml version="1.1" encoding="utf-8-with-controls"?>
or
 <?xml version="1.1" encoding="utf-8"  controls="allow" ?>

That has the advantage of moving the issue into being one of labelling rather
than invisible characters, with a safe default. And it would save the XML Core WG 
from accusations of favouring Westerners over Asians since non-ASCII users do face
robustness issues that ASCII-repertoire users do not. (And, actually, because of the 
Euro issue, this is now more like [English, Bahasa]-users versus non-[English, Bahasa]
users.)

> I don't personally see any good reason why C0 (and C1) characters shouldn't
> be permitted XML characters, with the restriction that they must be written
> as character references.

There is no reason from SGML compatability. It would be merely an additional requirement
for XML that the particular characters are only referenced not used directly, 
and something that serializers should attend to. 

Another alternative is to define built-in named character references (i.e. like &lt;)
based on the actual control characters: so  people can type   
<p>blah&BEL;blah&EOT;blah</p> Of course, it is likely that people who want 
to send information using C1 controls are not actually using the ISO 6429 characters: 
they are using the characters for some private, proprietary or nefarious purpose 
rather than the public, resolved, robust, safe data interchange for which XML was created.  
So named character references would not really answer their needs.


Cheers
Rick Jelliffe
(Writing personally)

[1] One point about XML being textual or not is whether you need an API to 
access/create/read the data or not.  If you need an API, then the issue arises
who controls the API?

Follow-Ups:
- Re: [xml-dev] MSXML DOM Special Chars Less Than 32
  - From: John Cowan <jcowan@reutershealth.com>

References:
- RE: [xml-dev] MSXML DOM Special Chars Less Than 32
  - From: "Michael Kay" <michael.h.kay@ntlworld.com>

Prev by Date: Re: [xml-dev] MSXML DOM Special Chars Less Than 32
Next by Date: Re: [xml-dev] MSXML DOM Special Chars Less Than 32
Previous by thread: Re: [xml-dev] MSXML DOM Special Chars Less Than 32
Next by thread: Re: [xml-dev] MSXML DOM Special Chars Less Than 32
Index(es):
- Date
- Thread