xml-dev - RE: [xml-dev] MSXML DOM Special Chars Less Than 32

RE: [xml-dev] MSXML DOM Special Chars Less Than 32

[ Lists Home | Date Index | Thread Index ]

To: "Mike Brown" <mike@skew.org>,<xml-dev@lists.xml.org>
Subject: RE: [xml-dev] MSXML DOM Special Chars Less Than 32
From: "Michael Rys" <mrys@microsoft.com>
Date: Wed, 27 Mar 2002 20:10:25 -0800
Thread-index: AcHVSn+D0JGIqTFhQAOvJsOsqUKDtwAw6kYg
Thread-topic: [xml-dev] MSXML DOM Special Chars Less Than 32

With all due respect, if you would have read the mail that prompted this
question, the proposal was to change the role of character entitization.

We are trying to solve a problem and that may mean to get rid of old
rules while being backwards-compatible as far as possible and with as
many agreement that we can get (ie. Get a rule into XML 1.1).

Best regards
Michael

> -----Original Message-----
> From: Mike Brown [mailto:mike@skew.org]
> Sent: Tuesday, March 26, 2002 20:44 PM
> To: xml-dev@lists.xml.org
> Subject: Re: [xml-dev] MSXML DOM Special Chars Less Than 32
> 
> James Clark wrote:
> > > How would you serialize a C# string that contains the sequence
> > > 0xD800,0xD800?  If you serialize it as &#xD800;&#xD800;, then what
> > > happens if somebody writes &#xD800;&#xDC00;? Is that equivalent to
> > > &#x10000;?
> 
> Michael Rys wrote:
> > This is basically a question of the encoding. If you use UTF-16 then
> > it's the parser's job to take &#xD800;&#xDC00; and map it into the
> > encoding of the target environment. If you use UCS-4 as the
encoding,
> > then you probably did not generate &#xD800;&#xDC00; in the first
place
> > but &#x10000;...
> 
> With all due respect, this is nonsense.
> 
> 1. A character reference is a lexical construct for representing a
single
> Unicode character by its decimal or hexadecimal code point. It is not
a
> generic mechanism for representing any code point, and it is not a
> mechanism
> for representing characters by their code values in some encoding
form.
> 
> A character reference must only reference a code point that
corresponds to
> a
> Unicode character, and that character must be legal in XML. Character
> references have nothing to do with encoding, and I hope this
discussion is
> not
> proposing that XML change in this regard.
> 
> 2. Code points 0xD800-0xDFFF are *not* mapped to characters in Unicode
/
> ISO/IEC 10646. That's why they're excluded from XML's char production.
> There
> is no character number 0xD800, and there never will be.
> 
> 3. The only way to use character references to represent character #
> 0x10000
> is to write &#x10000; or &#65536;. "&#xD800;&#xDC00;" is not
well-formed
> XML
> (see WFC: Legal Character in sec. 4.1 of the spec).
> 
> The answer to James' first question is that if the C# "string" is
actually
> a
> sequence of 16-bit values, and there are no guarantees that these
values
> are
> going to conform to the rules of UTF-16 or some other predictable
> encoding,
> then it is wrong to blindly serialize that data with a mechanism that
> writes
> out the hex values preceded by "&#x" and followed by ";".
> 
> Just as one would do with the 0x0000-0x0008 and other control
characters,
> you
> look at it before you serialize it, and say "can I put this in XML or
> not",
> and if it's no way you can make a legal character out of it, then you
> ignore
> it, or you raise an exception, or (though I don't agree with this),
you
> use
> "?". You don't emit malformed XML.
> 
> 
> I understand that Michael Kay is proposing that the well-formedness
> constraint
> be modified such that "&#x0;" is legal but the bytewise encoded NUL is
> not,
> and perhaps the discussion above is based on "what if" that kind of
thing
> were
> allowed. I have issues with his proposal as well, but at least the
0x0-
> 0x1F
> code points do map to actual characters, whereas "&#xD800;" and
"&#xDC00;"
> and
> (as another example)  "&#xFFFE;" do not.
> 
> FWIW, I am strongly against changing the definition of a character
> reference
> in order to make "&#xD800;" allowable. Let's not proceed with
theoretical
> arguments that assume character references are more complex or
flexible
> than
> they actually are.
> 
>    - Mike
>
________________________________________________________________________
__
> __
>   mike j. brown                   |  xml/xslt: http://skew.org/xml/
>   denver/boulder, colorado, usa   |  resume:
http://skew.org/~mike/resume/
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>

Follow-Ups:
- Re: [xml-dev] MSXML DOM Special Chars Less Than 32
  - From: Mike Brown <mike@skew.org>

Prev by Date: RE: [xml-dev] Compiled XML
Next by Date: Re: [xml-dev] MSXML DOM Special Chars Less Than 32
Previous by thread: Re: [xml-dev] MSXML DOM Special Chars Less Than 32
Next by thread: Re: [xml-dev] MSXML DOM Special Chars Less Than 32
Index(es):
- Date
- Thread