Lists Home |
Date Index |
Forwarded for Rick Jeliffe:
From: Rick Jelliffe [mailto:firstname.lastname@example.org]
Derek Denny-Brown wrote:
> XML 1.0 had surrogate pairs. The Unicode 2.0 code-point space was
But XML Unicode 2.0 and Unicode 3.0 did not define any characters
in those code points. So the presence of any surrogate characters in
names was an error and did not need to be checked by any special mechanism.
For data, there are no standard mappings from user-defined
characters points in East Asian regional encodings to Unicode non-BMP PUA,
so any use of those in data would be unreliable except by luck within the
same platform (I believe MS has its own mappings to the BMP PUA but I
am not aware that this involves non-BMP and therefore, potentially,
XML 1.0 only mentions surrogates to say that they are not characters
(i.e. that character in XML is Unicode Character not UTF-16 code)
and does not require that surrogates pair be checked.
Note that XML 1.0 says "It is a fatal error if an XML entity is determined
(via default, encoding declaration, or higher-level protocol) to be in a
certain encoding but contains byte sequences that are not legal in that
encoding." but, for UTF-16, that just requires going from bytes to
surrogates, not from surrogates to non-BMP or checking that entities
don't start with combining characters.
> I over simplified my description of what characters are allowed in
> names, agreed. I have been told by some of our customer reps that the
> allowed character range for names has been a blocking issue for some
> Asian customer. It may be that it is only one key character which is
> causing the problem, or it may be an entire class of characters. I only
> know that it is blocking customer adoption.
The only thing I can think of is the Unicode presentation forms,
from U+F9000 to U+FFEE. This notably includes full-width Latin
and halfwidth katakana. However, note that the JIS technical note
"Those digits, Latin characters, and special characters of [JIS X 0208]
which are also specified by [JIS X 0201] are deprecated. Likewise,
Halfwidth Katakana of [JIS X 0201] are deprecated."
<geek>There is debate over whether the half and fullwidth forms really
represent different characters (in the context of data that can be marked
up), and if they are different whether they cause problems and should be
deprecated anyway, or whether they analogy is with upper and lower case,
it is the user's problem to figure out when one is used an not the other.
At the time, there was a comment that because different UIMs defaulted to
full- or half-width forms differently, there would be more incompatability
caused by allowing them into names than not. Unicode 2.0 is clear that
the presentation forms are provided to allow round-tripping of texts, even
though in Unicode terms they are the same characters.
Note that the latest version of W3C's Character Model says
says "Specifications SHOULD exclude compatibility characters in the
syntactic elements (markup, delimiters, identifiers) of the formats they
Other useful specs include http://www.w3.org/TR/unicode-xml/
which says how to treat characters in data (is says "in markup",
but it really means "in marked-up data content" rather than,
for examples, "in tags".)
XML 1.1 removes checking those, adding them to the characters that
should be voided in names (becasue they have canonical decompositions).
All that being said, it is difficult to see how this issue could "block"
the use of XML by anyone however. And, as can be seen, it is part of
a larger issue that even JIS has been involved in.
John Cowan wrote
>Rick Jelliffe wrote:
>> As another matter, Derek mentions spurious whitespace nodes. But if
>> using a DTD (and validating parser) these nodes will not
>> be generated.
> They'll still be generated, because conforming XML parsers have to return
> all character content. But they will be marked as element content
> For example, the SAX callback "ignorableWhitespace" (which means ignorable
> by applications that wish to ignore it, not ignorable by parsers) is the
> SAX way to signal that the returned character content is element content
I mean nodes in the DOM: use setIgnoringElementContentWhitespace(true)
if this is an issue. It clouds the issue to say this is an XML problem
when it is the responsibility of the API (and hence the programmer's
control) to decide.
I know that people are talking about "XML" including APIs, architectures
and products, as a milleau not just the wording of the XML REC. But it is
a fallacy to list problems of "XML" considered broadly then switch to
"XML" considered as a limited spec: "I had a problem with API XXX; API XXX
uses XML; therefore XML is bad". If the problem is that there are too
SAX and DOM properties, features and options, then that is not XML's
problem: maybe APIS should just have a big switch
to keep things simpler for programmers:
DATA to strip out all insignificant whitespace and non-leaf text nodes
containing only whitespace.
DOCUMENTS to strip out insignificant whitespace
ROUND-TRIP to preserve everything