Lists Home |
Date Index |
First I should say - just in case there's any doubt about it - that
though I am a member of the XML Core WG, this is just my personal
I don't really care about any of this stuff. I have little use for
anything but ASCII in XML names. Control characters rarely cause me
trouble. I don't use IBM mainframes. I have no wish to ever
normalize Unicode text. But the Core WG has been asked to consider
all these issues, which for most of us is just another chore. There's
only a very small group of people who find character set issues
interesting, and I'm not one of them.
In article <004301c1847b$8f696f20$01b7c0d8@AlletteSystems.com> Rick writes:
>By allowing any character in names, it means that we can have WF XML 1.1
>documents which merely opening in a text editor (even an editor for the
>document encoding) will corrupt with a well-formedness error: if people use
>characters in names which may be split at by automated line-wrapping. A
>markup language which safe practise is to *never* open an entity in a text
>editor? Excellent advance!
Evidently people don't want to be stuck with Unicode 2.1 for XML
Names. Now XML could either move to some newer version of Unicode, or
have some automatic mapping from Unicode character classes to name
characters, or just allow (almost) everything and say that it's not
the business of XML to deal with this sort of detail. The first two
both require parsers to change as Unicode changes; the third is a
once-and-for-all change - I think that's the main reason why it is
what has been proposed.
The issue of line-wrapping is not one I was aware of, and one purpose
of publishing Working Drafts is precisely to get feedback from experts
like Rick on such things. On the other hand, I would never use an
editor that line-wrapped *any* of my files with an explicit command.
Do you really use such things on XML files? (I did once inadvertently
do fill-region on a Python program, which took an hour to fix.)
>I would guess that putting in Issue 18 and Issue 21 (should control
>be allowed? should 0x00 be allowed?) are just sacrificial lambs, put in to
>be removed later but not serious suggestions.
Not really. They were both seriously proposed. I think the case of
nul is certain to be rejected; it was left in for completeness. But
there are many applications where it would be useful to include
control characters - you have only to look back at the archives of
this mailing list and comp.text.xml to see people asking for advice on
>A markup language which was unsafe to store in files
I don't know what you mean by that.
>or to transmit on serial lines
Really? I have often transmitted 8-bit data over serial lines.
>or as text/*?
If people want to include control characters in their data, they will
have that problem regardless of whether they mark it up in XML.
One possibility I have heard of is that control characters could be
allowed, but only as character references.
Your comment about "sacrificial lambs" seems to suggest that someone
is trying to push this through against public opposition. As far as I
am concerned, we are only doing this because people wnat it. In fact,
yours is the only such strong opposition I have heard.
>It would be interesting to speculate what principle causes characters to be
>considered whitespace: certainly it is not that all visible space should be
>whitespace (one sensisble rule) or that only ASCII should be space.
>Why is not just mapping NEL to #A on input enough to satisfy the IBM
Enything mapped to #A on input should also be a whitespace character,
so that it behaves consistently if it appears as a character reference
in an internal entity. 2028 has been added because it has Unicode
backing as a universal line separator.
>This gives us a markup language in which all markup a WF document could look
>by inspection as if every character is ASCII but could not be serialized out
>to ASCII. because of NELs or LS characters.
Unicode is full of characters that look like ASCII but aren't. Most
of the Greek capitals for example.
>Another great joke is to "simplify" the naming rules to free a parser from
>having to worry about future upgrades to Unicode, but then requiring
>Normalized data (and suggesting it should be an error): surely this just
>ties the parser to having to know a particular version of Unicode to know
>which normalization rules to use!
I think most parser writers don't want to have to check for
normalization; it is the i18n people who are pressing for this. I
agree that this seems to either argue against normalization or weaken
the case for allowing all characters in names. Or is it intended that
future Unicode additions won't change the normalization rules?