OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] xml over http - RFC 3023

Andrew Welch wrote:
> Hi Rick,
>> The out-of-band signalling of character encoding is a fundamentally broken
>> idea, because there are no mechanisms for programs which generate data to
>> memoize the character encoding used that can then feed the rest of the
>> food-chain.
> How about the BOM - that's one way isn't it?  I wonder if a similar
> ignorable byte sequence could be added to the start of all byte
> sequences to indicate the encoding of what's coming.
There is: it looks like this  <?xml version="1.0" encoding="...

This has the added advantage of being visible in text editors, unlike 
the BOM (usually).
>>> At the moment it all seems pretty complicated...
>> It is not complicated. Use application/xml
>> If you do find intermediate web systems that implement the ASCII default or
>> the IS8859-1 default as anything other than 8-bit clean for text/xml submit
>> a bug report.
> So this is a real test of XML on the web.  The complicated part I was
> referring to is reading the bytes from the http input stream in the
> right encoding:
> - extract the encoding from the contenttype
> - if its not there read the first few bytes of stream in us-ascii and
> then extra the encoding from the prolog
> - if its not there use utf-8
> - hope that actual encoding of the file and the encoding you've discovered match
> ...and that's not even completely correct as far as I understand.
For application/xml you ignore the first step and go straight to the 
document. If your data is usually in UTF-8 or ASCII, you could perhaps 
read in the first block from bytes to characters and (if the transcoder 
has not generated an exception) confirm that there is no XML encoding 
declaration or BOM or that the string "utf-8" does not appear in the XML 
encoding declaration, in which case you don't need to do anything more 
complicated. If your data is text/xml, you are indeed in a sea of 
complication, which is why text/xml has been discouraged for so long.

The detection method is specified in appendix F of the XML spec. I have 
implemented it a couple of times. Many other people have implemented it. 
There is lots of code floating about.

> So when you say:
> "It is not complicated. Use application/xml"
> I don't get it, what am I missing?
> I would've thought the webserver would be aware that it was serving
> xml and take of it - it could extract the encoding from the xml prolog
> and ensure the file was served with that (maintaining it however it
> liked)... it seems odd that the client should go through this process
> every time.
Maybe, but the mechanism for this occur, for Apache at least, is for 
someone to write it, contribute it, champion it and maintain it.  The 
reason webservers typically don't as I understand it, is that they are 
too busy to transcode: they need to transfer bits as fast as they can. 
This is the problem, typically the webserver is not set to correctly 
generate the header, and  in nay case, how does the server know what the 
encoding of a particular files is?

But the basic XML contract is that the encoding must be explicitly 
labelled by the sender (creator of the document) and the recipient 
should not guess but use the label. If this is too much for naive users, 
then XML is simply not the technology for them, and XML should not be 
blamed for not working in a situation it explicitly was designed to 
avoid. It is just like if someone does not know what + means they cannot 
use a calculator. It is not an indictment of mathematics if someone who 
does not know + cannot use a calculator. Character encoding is just as 
fundamental to computer programming as knowledge of the difference 
between floats and ints, for example: that Western computer science and 
IT courses have guaranteed the ignorance of their students in this is sad.

In any case, I thought most people had written off RSS as unprocessable 
by generic XML tools, because so much RSS was not well-formed? I thought 
one reason for Atom was that the early RSS systems creators messed up 
their XML and RSS never recovered.  With RSS, what you are not 
experiencing the failure of XML on the web, you may be experiencing the 
failure of non-WF XML (and the potential complexity of figuring out 

Rick Jelliffe

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS