xml-dev - Some comments on the 1.1 draft

Some comments on the 1.1 draft

[ Lists Home | Date Index | Thread Index ]

To: <xml-dev@lists.xml.org>
Subject: Some comments on the 1.1 draft
From: Tim Bray <tbray@textuality.com>
Date: Fri, 14 Dec 2001 17:02:41 -0800

I sent this to the public blueberry-coments address, but 
thought some of them might usefully be discussed here.  If
someone wants to start an argument about one or more of 
these, please pull it out and give it a separate subject
line.
========================================================
1. The principle of decoupling the XML spec from successive
revisions of Unicode is the only sensible way forward.

2. If no consensus can be built around the details of this
set of changes, it would be acceptable to declare defeat and
go on with XML 1.0 2nd ed as-is.  This would be a regrettable
outcome but not fatal at a deep level.

3. Issue 18: The costs of allowing #x1-#x1F appear to me to
exceed the benefits.  Among other things, many of these 
ASCII control chars, despite being several decades old, have
little consensus concerning their semantics, e.g. EOT and EOM
(#x3 and #x4).  I think from the XML point of view these things
are actively pernicious; specifically the notion that semantics 
are embedded in characters rather than being expressed by markup.  
The case of "textual content that may contain such characters 
(but typically does not)" is pretty non-convincing.  In *many* 
cases the occurrence of these characters is evidence of an error.

4. Issue 21: The cost of allowing null bytes in XML content is
very high and the benefits hard to understand.

5. I strongly feel that #x85 (NEXT LINE) should not be added to
the S production.  The reason is a simple cost-benefit analysis;
the proportion of computing installations where this is an issue
is not large and is shrinking as a proportion of the 
infrastructure.  Supporting this change imposes significant 
conversion costs on the rest of the world; the total global
net cost would be significantly less if the mainframe software
infrastructure took the necessary corrective measures to deal
with XML 1.0 as specified.

6. I strongly feel, even more so than in the case of #x85,
that #x2028 is inappropriate for inclusion in S.  Here are
some reasons:
 - If LINE SEPARATOR is to be included, why not the many 
   other Unicode characters with spacing semantics?  A
   coherent explanation needs to be provided on this
   point and I am unconvinced that one exists.
 - This would be the only core XML syntax character that
   can't fit in a byte.  This would complicate several
   automaton-driven parser construction strategies.  One
   of the key design goals of XML is to make programmers'
   lives simpler, so this objection should have weight.
 - "For completeness" is a really flimsy argument.

7. In [4], #x37a is included, which is a combining 
   character and shouldn't be in NameStart

8. In [4], #xf7 is included (division sign), but the
   rest of the mathematical operators (starting at 
   #x2200) are excluded.

9. The inclusion of a block #x202A-#218f is kind
   of puzzling... it starts in the middle of one of the
   punctuation blocks, and the first few chars seem
   really unsuitable.  What's the intent... wanting to
   include the currency symbols?  This definitely
   needs some explanation.

10. There are some problems in the #x2800-#xD7FF block.
    Do we really want CJK radicals (#x2e80...), compatibility
    Jamo, ideographic description chars, and so on?

11. SHould that block end at #xD7aF or #xD7FF?

12. [#xFDE0-#xFFEF] includes the private use area and lots
    of compatibility characters which XML 1.0 actually
    deprecates for use at all, let alone as names.  This
    is astounding and needs some defense.  If this is OK,
    why not throw in all the punctuation?

13. What's wrong with ASCII digits as name start chars, given
    that all sorts of other digits are going in?

14. There really needs to be some deep discussion in this 
    document of why this alternative was chosen.  When I 
    look at some of the wildly unlikely things that are 
    allowed to appear in names, the obvious question is:
    Why not rely on the Unicode properties database.  In
    particular, this allows lots of Name characters that 
    are not in fact Unicode characters at all and probably
    never will be.

15. Issue 11:
    I can see both sides of this question.  My intuition is
    that the computational cost of doing this is unacceptably
    high for high-throughput applications of XML, but we need
    some research to establish if this is the case.  If it can
    be done cheaply and compactly, it's probably a good idea.

Follow-Ups:
- Re: [xml-dev] Some comments on the 1.1 draft
  - From: Richard Tobin <richard@cogsci.ed.ac.uk>

Prev by Date: Question
Next by Date: Another Unicode nit
Previous by thread: Question
Next by thread: Re: [xml-dev] Some comments on the 1.1 draft
Index(es):
- Date
- Thread