OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bad News on IE6 XML Support



From Elliotte Rusty Harold:
 
> >Microsoft already knows about this and made the choice for the reasons I 
> > described. Here's a malformed document:
> ><test>&#x0005;</test>

From: "Joshua Allen" <joshuaa@microsoft.com>

>This falls into the category of not breaking stuff that already works.

No, it falls into the category of maintaining stuff that is already broken.  

Of course MS have a good point, that the general principle of
being conservative in what you send and generous in what you accept
has served well.  And that a browser is a platform for showing information
not hiding it (well, this is reasonable for an editor: but for a browser it is
not convincing really).  

XML went the draconian way on purpose, because of the needs for 
interoperability.

Why is the character 5 not allowed?  Because it has a specific semantic, ENQ,
described on a good site as:

"A transmission control character used as a request for a response from a remote station; the response may include station identification and/or station status. When a "Who are you" function is required on the general switched transmission network, the first use of ENQ after the connection is established shall have the meaning "Who are you" (station identification). Subsequent use of ENQ may, or may not, include the function "Who are you", as determined by agreement. "[1]

Any XML document which includes such characters has either been created in
ignorance, or has had a networking problem where data from some lower layer
has been included into the text stream.  

In the first case, anyone who includes control characters in a text/* document doesn't know the fundamentals of their business and should go back to school: while 
it is a legitimate need to transmit binary data, there are ways to encode binary in XML and there are a million code points available for user-defined function characters if needed.

In the second case, there has been an error and normal functionality of the system
has broken down: the document should not be accepted by the receiving system
as if it were normal. 

For a further reason why control characters should not be allowed, the
MIME types for XML proposed IETF standard mentions:

   "Some terminal devices have keys whose output, when pressed, can be
   changed by sending the display processor a character sequence.  If
   this is possible the display of a text object containing such
   character sequences could reprogram keys to perform some illicit or
   dangerous action when the key is subsequently pressed by the user.
   In some cases not only can keys be programmed, they can be triggered
   remotely, making it possible for a text display operation to directly
   perform some unwanted action.  As such, the ability to program keys
   SHOULD be blocked either by filtering or by disabling the ability to
   program keys entirely."[2]


> The whole "blueberry" debate should have made it apparent that there are
> many, many real people who produce XML that has such characters in it.

No, they are not producing XML. They are producing something else.

> We fixed the XML parser to correctly fail on such characters, but
> unfortunately the IE team does not agree with your assessment that it
> would be a good thing for customers to have us break their existing
> apps.  

It is not up to the IE team to decide what is in XML: either use it
or make up your own spec. 

> And don't even try to claim that the existence of XML documents
with such characters is Microsoft's fault.  I have seen these coming
from VMS and Unix systems just as much as anywhere else.  The IE team
chose not to penalize users for having such documents.  

No,  they chose to penalize everyone else's XML systems. If that IE
parser accepts something and Oracle's correctly finds the error,
are the punters supposed to know that MS messed up and Oracle is correct?
(Or any other example, such as Sun's parser etc.)  Why isn't this just
"embrace and extend", which the rest of us are sick of. 

> OK, so I read through the thread, and I can see where you are coming
> from.  I would argue that your first example (ASCII 0x05) is an example
> of IE *not* crippling XML as much as the spec demands, and I do not
> agree with the level of emergency you place on this particular issue.

That is not a problem if control characters are sent in files?  And that
there is in something called "XML" which exists outside what the spec
"demands?"  The spec "specifies": we can certainly argue that, being
part of a social process, any spec is incomplete  but the MS guy seems
to be saying the spec can be wrong on what XML is...yikes, seems a
little pathetic. 

What about ^Q, ^S, (serial flow control),  ^D, ^Z, (UNIX and DOS file
ends), NULL (C string end), DEL ?    If we were debating the specific
merits of control characters, it would be a different matter (one
could say that some control characters are obsolete.)  

The text/* MIME type is quite clear about control characters:

" .. any use of the control characters or DEL in a body must
   either occur

    (1)   because a subtype of text other than "plain"
          specifically assigns some additional meaning, or

    (2)   within the context of a private agreement between the
          sender and recipient. Such private agreements are
          discouraged and should be replaced by the other
          capabilities of this document."[3]

which delegates it to XML: 

"Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646."

"Character Range[2]    Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ 
"

"It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains octet sequences that are not legal in that encoding. It is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16."[4]

When an XML processor discovers a character 5, it either should fail because
it decides the character is the control character 5 and so not "legal", or it should
fail because it indicates some encoding error. 

Personally, I am looking forward to the release of MSXML 4.0.  It looks like
it will have exellent performance, and be well-thought out.  I can understand 
the need to support obsolete versions of draft specs (XSL, Schemas) to that
they don't abandon users (a touching dedication to supporting existing users
which will be a great source of comfort to all Java users and plug-in users, I am sure)
but XML is simply not like that. 

It was designed to be draconian for maximum interoperability. Interoperability 
is not served by text documents with embedded control characters. 

Cheers
Rick Jelliffe

[1]  http://www.malibutelecom.fi/yucca/chars/c0.html
[2] ftp://ftp.isi.edu/in-notes/rfc3023.txt
[3] ftp://ftp.isi.edu/in-notes/rfc2046.txt  page 8
[4] http://www.w3.org/TR/REC-xml