[
Lists Home |
Date Index |
Thread Index
]
bryan wrote:
> Again the most serious RSS problem for me is escaped html, as such it
> indicates a necessary wrong in any technology, as you are always going
> to require a method for escaping characters reserved for your
> technology.
I found escaped HTML in RSS to be mostly an aesthetic problem
(in that it deeply offends my aesthetic sensibilities :-),
but not too hard to process. Feed the element content into
a tag-soup parser, infer start- and end- tags to turn it into
a tree, and strip out all the elements you don't want showing up
in the aggregator output. Took me about two hours to code this up
(to be fair, I did use an off-the shelf lexer for the first step).
The biggest problems I've had with RSS have to do with
inconsistent usage. For instance: some people put a
summary, abstract, or lead paragraph in the <description>
as was intended, others put the whole damn entry in there,
complete with fifteen paragraphs, two bulleted lists, and
six pictures of their cat. Is <dc:creator> the author's
email address, full name, or a user ID? I've seen half a
dozen different formats for dates (and many feeds don't even
include them, which is a real PITA since I want everything
sorted reverse-chronologically.)
Then there are the encoding problems. HTTP Content-Type header
says "text/plain" with no ";charset=" parameter (implying us-ascii),
the XML declaration says "utf-8", but it's actually in iso8859-1.
Variations on this theme abound.
Those are minor annoyances, that don't greatly affect functionality.
<link> vs. <guid> is another matter. Some feeds put the URL
of the item itself in the <guid>, and the URL of the thing
the item is talking about in the <link>. Others only use <guid> and
don't include a <link>. Most, however, put the URL of the item
in the <link> and an opaque ID in the <guid>.
The distinction can, of course, be determined by the
"isPermaLink" attribute on <guid>; if it's "true", or omitted,
then the <guid> is the real link and the <link> is... well,
something else. If it's "false", then the <link> is the
real link and the <guid> can be ignored.
However, there are differences of opinion on how to capitalize
the attribute. Some spell it "isPermaLink" (Bactrian), others
spell it "isPermalink" (Dromedary). Not a problem for
the DPH regexping his way through, but if you're using
a real XML processor adapting to case-insensitivity
is a bit tedious.
(For the record: according to Winer's spec the correct spelling
has two humps.)
Now as far as a headline browser is concerned, the item's URL
is arguably the most important bit of information about it.
It shouldn't take so much effort to locate it.
--Joe English
jenglish@flightlab.com
|