[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

XML Blueberry

From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
To: xml-dev@lists.xml.org
Date: Thu, 21 Jun 2001 09:34:36 -0400

The W3C XML Core Working Group has posted the first public draft of XML
Blueberry Requirements:

http://www.w3.org/TR/2001/WD-xml-blueberry-req-20010620

This is a proposal for a new BACKWARDS INCOMPATIBLE version of XML. 
The specific goal is to address some shortcomings of the XML 1.0 
character model relative to Unicode 3.1, as well as throwing a sop to 
IBM.

The concern with respect to IBM is that one of the world's largest
corporations, with thousands of patents, legions of programmers,
billions of dollars in revenue, and resources pouring out of every
orifice is somehow unable to handle documents where lines end with
carriage returns and line feeds, as they do on every non-IBM system on
the planet. The only reason there's a problem here at all is because IBM
tried to go it alone as a monopoly and set standards by fiat for years
rather than working with the rest of the industry. Consequently their
mainframe character sets don't really interoperate well with everybody
else's character sets. In XML this arises as a problem with line endings
when someone edits an XML document with an IBM mainframe text editor.
IBM mostly grew out of their anti-competitive monopolistic tendencies
over the last thirty years (with a large dose of assistance from the
U.S. government). However, there are still some legacy issues relating
to their attempt to dictate standards to the rest of the industry, and
this is one of them. Now rather than fixing their own broken mainframe
text editing software, they want everyone else on the planet to change
their software so IBM doesn't have to. (If this reminds anybody of 
the current mess with Oracle and UTF-8, you're not alone.) This 
proposal was laughed out of
the W3C a few months ago when IBM made it, or at least it seemed to be.
However, it's now risen from the dead as part of XML Blueberry; but it
doesn't make any more sense now than it did then; and it still deserves
to be laughed off the table with whooping cries of derision.

The second proposal for breaking backwards compatibility with existing
parsers is much more serious, and requires a more thoughtful response.
Starting in Unicode 3.0 a number of new characters have been added both
for new scripts that were previously unencoded such as Amharic and
Cherokee as well as for old scripts that were incomplete such as
Chinese. The concern is that since XML 1.0 is based on Unicode 2.0,
"fully native-language XML markup is not possible in at least the
following languages: Amharic, Burmese, Canadian aboriginal languages,
Cantonese (Bopomofo script), Cherokee, Dhivehi, Khmer, Mongolian
(traditional script), Oromo, Syriac, Tigre, Yi. In addition, Chinese,
Japanese, Korean (Hangul script), and Vietnamese can make use of only a
limited subset of their complete character repertoires."

If this were true, it would be a very serious criticism of XML 1.0
Fortunately, however, the claim is not nearly as dire as the proposal
makes out. Indeed the proposal substantially overstates the need for any
changes. The XML 1.0 BNF productions do not allow these newly defined
characters to be used in element, attribute, and entity names. However,
they can be used in the text of element content and attribute values.
This means that XML is fully adequate for literature and data in
Amharic, Burmese, Canadian aboriginal languages, Cantonese, Cherokee,
Dhivehi, Khmer, Mongolian, Oromo, Syriac, Tigre, Yi, Mandarin, Japanese,
Korean, and Vietnamese. Only the markup, that is, the tags, would have
to be written in another script. Given that there aren't even localized
operating systems in most of these languages, and that today's software
effectively requires users to have a solid knowledge of at least the
ASCII characters, I don't think the need to write markup (as opposed to
text) in Cherokee justifies breaking backwards compatibility.

But wait! It's not even that bad. Several of the languages listed are
total red herrings. You most certainly can write markup in Cantonese,
Japanese, Korean, Mandarin, and Vietnamese today. The new characters
Unicode has added to these scripts are very obscure. In fact, experts
often disagree over whether some of them exist at all, or are merely
typographical variations of existing characters. Since the 1700s
Vietnamese has been written in a Latin-based alphabet that is fully
available in XML and that can write any Vietnamese word. Vietnamese only
uses the Han ideographs for classical documents and occasional signage
or decoration, and it seems very unlikely that a Vietnamese speaker
would write their markup using Han ideographs. Japanese has not one but
two phonetic alphabets that can write any Japanese word if the right Han
ideograph character is not encoded. Chinese speakers can use either
Latin characters or the native Bopomofo phonetic system for the very
rare cases where a character they need is not encoded. The fact is most
native speakers of Chinese, Japanese, Korean and Vietnamese do not
recognize the vast majority of these new characters, and the need for
them in markup (again, as opposed to text) is non-existent.

There are a few good points in this proposal. I'm sure there's an
occasional need for writing markup in Amharic, Burmese, Khmer,
Mongolian, Yi, and a few of the other languages the proposal lists. But
I don't believe there's enough of a need to justify breaking
compatibility with existing XML parsers, software, and systems. The XML
Blueberry Requirements vastly overstate the case by ignoring the
difference between markup and text in XML documents. I'd be willing to
break backwards compatibility to allow text in these languages if we had
to, but we don't. Text is already adequately handled by XML 1.0. All
we're arguing about now are the tags, and that's just not a strong
enough reason to break backwards compatibility.
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|                  The XML Bible (IDG Books, 1999)                   |
|              http://metalab.unc.edu/xml/books/bible/               |
|   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/     |
+----------------------------------+---------------------------------+

Follow-Ups:
- Re: XML Blueberry
  - From: David Brownell <david-b@pacbell.net>
- Re: XML Blueberry
  - From: Rob Lugt <roblugt@elcel.com>
- Re: XML Blueberry
  - From: Peter Flynn <peter@silmaril.ie>
- A cautionary tale (was: XML Blueberry)
  - From: Chuck White <chuckwh@pacbell.net>

References:
- RE: XQuery & XSLT was RE: Verboseness - XML Syntax for XQuery1.0(XQueryX)
  - From: Michael Rys <mrys@microsoft.com>

Prev by Date: Re: Escher could have drawn it (Re: XML Schema and Entities)
Next by Date: RE: Escher could have drawn it (Re: XML Schema and Entities)
Previous by thread: RE: XQuery & XSLT was RE: Verboseness - XML Syntax for XQuery1.0(XQueryX)
Next by thread: Re: XML Blueberry
Index(es):
- Date
- Thread