OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Quick Review of XML 1.1 Candidate Recommendation

[ Lists Home | Date Index | Thread Index ]

From: "Tim Bray" <tbray@textuality.com>

> My problem is that XML has de facto been a significant step forward for 
> interoperability between heterogeneous systems, and this seems like a 
> step backward.  At the moment, we can say confidently that XML markup 
> exposes logical structure unambiguously, and the content is text, which 
> means a sequence of unicode characters, and the characters have the 
> semantics that Unicode says they have.  This is fine for characters such 
> as 'a' or &#x222b; (the integral sign), but the range &#x0; - &#x1f; is 
> another kettle of fish.  By my reading, none of the characters in the 
> ranges 0-#x7, #xb, #xe-#x1a have any agreed-upon semantics de jure or de 
> facto (let's go down to the mall and do some &#x16;). 
 
Starting with about Unicode 3.0, the U+0080-U+009F characters are 
now occupied by the ISO C1  controls, unless specifically overridden;
XML 1.0 and XML 1.1 does not specifically override.

See http://www.unicode.org/unicode/uni2book/ch13.pdf  s.13.1

XML 1.1 is intended to cope with Unicode 3.n, and the new fixing of the 
C1 controls is one of those things.  So the backwards compatability issue
is really one that springs from Unicode, not from XML IMHO.  It was 
pretty sus (or a convenient hack) to use the C1 code points before.

Tim's point about needing to follow the Unicode semantics is well-made and
important, but I think the XML 1.1 draft *does* do this. The semantics of
a text stream is that a control character appearing in it is a control character
that should be interpreted or stripped or used.  A control character that
is desired to be part of the data content (rightly or wrongly) should never
be sent directly: it is a mistake of XML 1.0 to allow direct C1 characters.

Ultimately, it comes down to a model of layering.  I believe the layering
is 
   applications and data stores
   -------------------------------------------------------------------------------
   Infoset data (can include controls not null)
   -------------------------------------------------------------------------------
   XML, which must be compatible with "textual" text/*  MIME
   -------------------------------------------------------------------------------
   text data being sent as a data stream, by some system using controls
   -------------------------------------------------------------------------------
   packets         
   -------------------------------------------------------------------------------

That is more the kind of old telnet/modem-ish model that the RFCs
have underlying them, and XML 1.1 supports this better than XML 1.0
does.   

The second prong that Tim raises is that in XML "the content is text"
(i.e. and not binary) by which he is suggesting that non-text data
should not be serialized as XML but first encoded using, say Bin64 notation.
Unfortunately, this currently requires some kind of schema processing 
and some kind of PSVI to extract the string: a lot of overhead for a little
feature. And the WXS Bin64 has a problem that there is no standard way to
say what the data is after it is decoded: what is its notation or MIME type?
So Bin64 can only be used with private conventions anyway.

As Richard comments, arbitrary binary data still cannot be sent, because
the U+0000 character NULL is not available in numeric character references.
If we have no objection to Bin64 encoded data content, I don't see the
problem with characters with controls as NCRs: both are textual and
opaque. 

> And furthermore, the reason why our friends at Microsoft & IBM et al 
> want this is so they can take filthy dirty data out of database fields 
> and wrap XML tags around it and claim interoperability, which is pretty 
> questionable. -Tim

As long as it is represented as text, why are the controls (when sanitized) any 
less filthy than the PUA characters?   I am all in favour of making XML more
comprehensive and more "textual" as a notation (in the terminology of the 
RFC for MIME types for text/*), and when this is still safe (no nulls), seems
to fit into Internet layers more, is more mainstream SGML-ish, *and* improves 
robustness no end (better encoding detection), it is a pretty credible package.   

Cheers
Rick Jelliffe

  • Follow-Ups:



 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS