OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Postel's law, exceptions

[ Lists Home | Date Index | Thread Index ]


Tim Bray wrote:
> On Jan 13, 2004, at 6:01 PM, Joe English wrote:
> > True, but the aggregator as a whole might still accept it --
> > possibly by noticing that the encoding is mislabelled and
> > munging the data into proper UTF-8 before passing it to the
> > parser.
>
> Wow, is there any software that actually does this?  I hadn't 
> encountered it.

My toy aggregator does, after a fashion.


> > In fact any aggregator that doesn't do something like this
> > is doomed to fail -- *nobody's* feed has the encoding labelled
> > properly.  (Well, maybe not "nobody", but certainly not very many.)
>
> On the contrary; the vast majority of them are correct. -Tim

That's not been my experience.  In a small sample of
the ~50 feeds I'm subscribed to, I find:

    5   with Content-Type: text/xml, no charset parameter [*],
        XML declaration claims to be UTF-8;

    5   with Content-Type: text/xml, no charset parameter,
        XML declaration claims to be ISO-8859-1;

    3   with Content-Type: text/html (!); charset="iso-8859-1",
        XML declaration claims to be "UTF-8";

    1   with Content-Type: text/html; charset="utf-8";
        XML declaration at least agrees about the "utf-8" part

    1   RSS feed with content-type: "text/html", no XML declaration

    1   with Content-Type: text/plain, no charset parameter [*],
        XML declaration says UTF-8;

    1   with Content-type: text/plain, no charset parameter,
        XML declaration says ISO-8859-1

    2   with Content-type: text/plain, no charset parameter,
        XML declaration says nothing (these might actually
        be correct, but probably only by accident since they
        happen to contain only 7-bit characters).

    1   with Content-Type: httpd/unix-directory (?!?)


[*] Which means either "US-ASCII" or "ISO-8859-1", depending
on which RFC you take as authoritative.

In 4 cases, the HTTP header and XML declaration agree
on utf-8, in 2 they agree on ISO-8859-1, and rest are
all "application/*".  Of the ones that agree, I can't
say for sure if they're accurate, since the feed itself
happens to contain only 7-bit data at the moment.

I'm surprised at the large number of feeds that don't
even get the *media type* right; I have serious doubts
about the accuracy of the charset parameter.


--Joe English

  jenglish@flightlab.com




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS