OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] MSXML DOM Special Chars Less Than 32

[ Lists Home | Date Index | Thread Index ]

James Clark wrote:
> > How would you serialize a C# string that contains the sequence
> > 0xD800,0xD800?  If you serialize it as ��, then what
> > happens if somebody writes ��? Is that equivalent to
> > 𐀀?

Michael Rys wrote:
> This is basically a question of the encoding. If you use UTF-16 then
> it's the parser's job to take �� and map it into the
> encoding of the target environment. If you use UCS-4 as the encoding,
> then you probably did not generate �� in the first place
> but 𐀀...

With all due respect, this is nonsense.

1. A character reference is a lexical construct for representing a single
Unicode character by its decimal or hexadecimal code point. It is not a
generic mechanism for representing any code point, and it is not a mechanism
for representing characters by their code values in some encoding form.  

A character reference must only reference a code point that corresponds to a
Unicode character, and that character must be legal in XML. Character
references have nothing to do with encoding, and I hope this discussion is not
proposing that XML change in this regard.

2. Code points 0xD800-0xDFFF are *not* mapped to characters in Unicode /
ISO/IEC 10646. That's why they're excluded from XML's char production. There
is no character number 0xD800, and there never will be.

3. The only way to use character references to represent character # 0x10000
is to write 𐀀 or 𐀀. "��" is not well-formed XML
(see WFC: Legal Character in sec. 4.1 of the spec).

The answer to James' first question is that if the C# "string" is actually a
sequence of 16-bit values, and there are no guarantees that these values are
going to conform to the rules of UTF-16 or some other predictable encoding,
then it is wrong to blindly serialize that data with a mechanism that writes
out the hex values preceded by "&#x" and followed by ";". 

Just as one would do with the 0x0000-0x0008 and other control characters, you
look at it before you serialize it, and say "can I put this in XML or not",
and if it's no way you can make a legal character out of it, then you ignore
it, or you raise an exception, or (though I don't agree with this), you use
"?". You don't emit malformed XML.

I understand that Michael Kay is proposing that the well-formedness constraint
be modified such that "�" is legal but the bytewise encoded NUL is not,
and perhaps the discussion above is based on "what if" that kind of thing were
allowed. I have issues with his proposal as well, but at least the 0x0-0x1F
code points do map to actual characters, whereas "�" and "�" and
(as another example)  "" do not.

FWIW, I am strongly against changing the definition of a character reference
in order to make "�" allowable. Let's not proceed with theoretical
arguments that assume character references are more complex or flexible than
they actually are.

   - Mike
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS