[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Well-formed Blueberry

From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
To: xml-dev@lists.xml.org, www-xml-blueberry-comments@w3.org
Date: Sun, 15 Jul 2001 11:28:08 -0400

Some more thoughts on requirements for well-formed Blueberry documents:

If the need for streaming documents makes it seem too problematic that only documents that use Blueberry name characters be allowed to carry a Blueberry declaration, then I propose a weaker alternative:

Only documents whose encoding declaration explicitly declares a character set which can include Blueberry characters is allowed to have a Blueberry declaration. e.g.

<?xml version="1.1" encoding="ISO-8859-1"?>

would be malformed. However, these would be well-formed:

<?xml version="1.1" encoding="UTF-8"?>
<?xml version="1.1" encoding="UTF-16"?>
<?xml version="1.1" encoding="UCS-4"?>

(I'm just using version="1.1" here to make my point. The details are not affected by what the Blueberry declaration eventually looks like.)

I further propose that the encoding declaration must be explicit. That is, this is malformed even though the default character set is UTF-8:

<?xml version="1.1"?>

My logic is that many authors just write this when what they really mean is encoding="US-ASCII". I do not think requiring Blueberry documents to 
explicitly specify UTF-8 is an onerous burden. Note that this does not change the default character set for  <?xml version="1.1"?> which would still be UTF-8. 

There are not that many encodings that can handle the Blueberry characters, basically just several variants of Unicode, one Japanese character set, and possibly a couple of Chinese character sets. Most of the scripts that are at issue here (Amharic, Khmer, Burmese, etc.) have never had a standard encoding prior to Unicode. Indeed that is exactly the reason it took until Unicode 3.0 to decide how to encode them. It was not possible to simply transpose an existing national character set. There have been numerous proposals for alternative encodings of Unicode lately, but all of them have been shot down with extreme hostility by the Unicode consortium. Thus I do not think it would be a huge problem to enumerate all the encodings anybody is likely to want for Blueberry characters. Certainly, any new encodings that do arise in the future should be round-tripabble to standard Unicode encodings. 
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+ 
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      | 
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
+----------------------------------+---------------------------------+

References:
- RE: Well-formed Blueberry
  - From: Julian Reschke <julian.reschke@gmx.de>
- RE: Well-formed Blueberry
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
- Re: Well-formed Blueberry
  - From: John Cowan <jcowan@reutershealth.com>
- Re: Well-formed Blueberry
  - From: David Carlisle <davidc@nag.co.uk>

Prev by Date: Re: SAX and the characters function
Next by Date: Re: building an object model of a XML schema
Previous by thread: Re: Well-formed Blueberry
Next by thread: Re: Well-formed Blueberry
Index(es):
- Date
- Thread