[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] xml over http - RFC 3023
- From: Rick Jelliffe <rjelliffe@allette.com.au>
- To: Andrew Welch <andrew.j.welch@gmail.com>
- Date: Mon, 01 Dec 2008 22:02:25 +1100
Andrew Welch wrote:
> Hi Rick,
>
>
>> The out-of-band signalling of character encoding is a fundamentally broken
>> idea, because there are no mechanisms for programs which generate data to
>> memoize the character encoding used that can then feed the rest of the
>> food-chain.
>>
>
> How about the BOM - that's one way isn't it? I wonder if a similar
> ignorable byte sequence could be added to the start of all byte
> sequences to indicate the encoding of what's coming.
>
There is: it looks like this <?xml version="1.0" encoding="...
This has the added advantage of being visible in text editors, unlike
the BOM (usually).
>>> At the moment it all seems pretty complicated...
>>>
>
>
>> It is not complicated. Use application/xml
>>
>> If you do find intermediate web systems that implement the ASCII default or
>> the IS8859-1 default as anything other than 8-bit clean for text/xml submit
>> a bug report.
>>
>
> So this is a real test of XML on the web. The complicated part I was
> referring to is reading the bytes from the http input stream in the
> right encoding:
>
> - extract the encoding from the contenttype
> - if its not there read the first few bytes of stream in us-ascii and
> then extra the encoding from the prolog
> - if its not there use utf-8
> - hope that actual encoding of the file and the encoding you've discovered match
>
> ...and that's not even completely correct as far as I understand.
>
For application/xml you ignore the first step and go straight to the
document. If your data is usually in UTF-8 or ASCII, you could perhaps
read in the first block from bytes to characters and (if the transcoder
has not generated an exception) confirm that there is no XML encoding
declaration or BOM or that the string "utf-8" does not appear in the XML
encoding declaration, in which case you don't need to do anything more
complicated. If your data is text/xml, you are indeed in a sea of
complication, which is why text/xml has been discouraged for so long.
The detection method is specified in appendix F of the XML spec. I have
implemented it a couple of times. Many other people have implemented it.
There is lots of code floating about.
> So when you say:
>
> "It is not complicated. Use application/xml"
>
> I don't get it, what am I missing?
>
> I would've thought the webserver would be aware that it was serving
> xml and take of it - it could extract the encoding from the xml prolog
> and ensure the file was served with that (maintaining it however it
> liked)... it seems odd that the client should go through this process
> every time.
Maybe, but the mechanism for this occur, for Apache at least, is for
someone to write it, contribute it, champion it and maintain it. The
reason webservers typically don't as I understand it, is that they are
too busy to transcode: they need to transfer bits as fast as they can.
This is the problem, typically the webserver is not set to correctly
generate the header, and in nay case, how does the server know what the
encoding of a particular files is?
But the basic XML contract is that the encoding must be explicitly
labelled by the sender (creator of the document) and the recipient
should not guess but use the label. If this is too much for naive users,
then XML is simply not the technology for them, and XML should not be
blamed for not working in a situation it explicitly was designed to
avoid. It is just like if someone does not know what + means they cannot
use a calculator. It is not an indictment of mathematics if someone who
does not know + cannot use a calculator. Character encoding is just as
fundamental to computer programming as knowledge of the difference
between floats and ints, for example: that Western computer science and
IT courses have guaranteed the ignorance of their students in this is sad.
In any case, I thought most people had written off RSS as unprocessable
by generic XML tools, because so much RSS was not well-formed? I thought
one reason for Atom was that the early RSS systems creators messed up
their XML and RSS never recovered. With RSS, what you are not
experiencing the failure of XML on the web, you may be experiencing the
failure of non-WF XML (and the potential complexity of figuring out
text/xml).
Cheers
Rick Jelliffe
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]