OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Supporting Unicode (was Some comments on the 1.1 dr aft)

[ Lists Home | Date Index | Thread Index ]

On Thu, Dec 20, 2001 at 04:59:55PM +1100, Rob Griffin wrote:
> Wasn't one of the design goals of XML to be human readable?
> How do I do that? I display the document on my screen,
> or I print it out. Surely having the least number of control 
> characters in the document makes that more readily achieveable.
> I don't want to have to use a hex editor to see the 'real' 
> contents of a document. Nor have my printer go ballistic
> or print blocks in place of control characters.

One of the arguments that was put forward was that for higher
level applications that have string fields somewhere in them,
but must be able to store the *occasional* control character
(for good or bad reasons - or simply something was wrong
originally). For example, I have an application where a human
can type in a string from a keyboard. Somehow they accidentally
typed in a ^T. The application they were using did not detect
it as an error. So the string contains an ugly ^T.

It would appear logical to encoding the application string as
PCDATA. It makes the document more readable. Almost all of the
data is normal text. However, because the original application
did not enforce exactly the same constraints as XML as to what
characters are legal, its not safe to put it in an XML document
without always base64 encoding it. This potentially means any
automatically constructed XML document should use base64 encoding
for text. If it does not, then it may fail to capture the original
content or abort with an error. Some people think this is good.
I think it is bad. (Note that if the bogus character was a tab,
it would be accepted hapily - so XML is only providing partial
protection anyway.)

Why bad? The main argument that I have heard (and has real merit
by the way) is that other systems might not be able to handle
the control characters or might do something weird with them.

To me, excluding selected values from an XML document is XML doing
*exactly* the same thing to everyone else who wants to use XML.
Its putting limits on them, not because *XML* cannot handle it,
but because XML *chooses* to limit them. The result that a core
and fundamental standard imposes limits on all the layers that
want to be put on top.

So any data that did not originally come from XML almost certainly
needs to be base64 encoded if its to be put into an XML document.
It could be a book title, part number, phone number, etc.
If this did not occur, then strange unexpected and cryptic errors
would be reported by subsystems that tried to use, for example,
SOAP to send a phone number.

I do agree that putting control characters directly into an XML
document is bad form and is likely to break other tools. Ok, its
bad practice. Don't do it (use &#n; instead). But I feel that
not allowing all code points in an XML document will *increase*
the overall problem for other systems. I think we want to reduce
the number of standards, not increase them, so encouraging people
to invent another standard instead of 'muddying' XML I think is
a mistake. I think there is going to be layer upon layer of
systems, protocols, information etc with XML at the core.
Having the core impose limits can cause problems for every layer
on top. And (I claim) the limits are not there because XML needs
limits, but because XML thinks its doing the rest of the world a
service by limiting them.

And thats where it gets hard. Philisophically, is it better to stop
people from doing things that might be wrong or better to allow people
to do more things and wear the responsibility if it was wrong?



News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS