[
Lists Home |
Date Index |
Thread Index
]
Tim Bray wrote:
> On Jan 13, 2004, at 6:01 PM, Joe English wrote:
> > True, but the aggregator as a whole might still accept it --
> > possibly by noticing that the encoding is mislabelled and
> > munging the data into proper UTF-8 before passing it to the
> > parser.
>
> Wow, is there any software that actually does this? I hadn't
> encountered it.
My toy aggregator does, after a fashion.
> > In fact any aggregator that doesn't do something like this
> > is doomed to fail -- *nobody's* feed has the encoding labelled
> > properly. (Well, maybe not "nobody", but certainly not very many.)
>
> On the contrary; the vast majority of them are correct. -Tim
That's not been my experience. In a small sample of
the ~50 feeds I'm subscribed to, I find:
5 with Content-Type: text/xml, no charset parameter [*],
XML declaration claims to be UTF-8;
5 with Content-Type: text/xml, no charset parameter,
XML declaration claims to be ISO-8859-1;
3 with Content-Type: text/html (!); charset="iso-8859-1",
XML declaration claims to be "UTF-8";
1 with Content-Type: text/html; charset="utf-8";
XML declaration at least agrees about the "utf-8" part
1 RSS feed with content-type: "text/html", no XML declaration
1 with Content-Type: text/plain, no charset parameter [*],
XML declaration says UTF-8;
1 with Content-type: text/plain, no charset parameter,
XML declaration says ISO-8859-1
2 with Content-type: text/plain, no charset parameter,
XML declaration says nothing (these might actually
be correct, but probably only by accident since they
happen to contain only 7-bit characters).
1 with Content-Type: httpd/unix-directory (?!?)
[*] Which means either "US-ASCII" or "ISO-8859-1", depending
on which RFC you take as authoritative.
In 4 cases, the HTTP header and XML declaration agree
on utf-8, in 2 they agree on ISO-8859-1, and rest are
all "application/*". Of the ones that agree, I can't
say for sure if they're accurate, since the feed itself
happens to contain only 7-bit data at the moment.
I'm surprised at the large number of feeds that don't
even get the *media type* right; I have serious doubts
about the accuracy of the charset parameter.
--Joe English
jenglish@flightlab.com
|