xml-dev - Re: [xml-dev] rss regularis(z)ation

Re: [xml-dev] rss regularis(z)ation

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] rss regularis(z)ation
From: Joe English <jenglish@flightlab.com>
Date: Tue, 22 Jul 2003 19:05:39 -0700
In-reply-to: <001701c35029$acc504a0$2001a8c0@bryans>
References: <001701c35029$acc504a0$2001a8c0@bryans>

bryan wrote:

> Again the most serious RSS problem for me is escaped html, as such it
> indicates a necessary wrong in any technology, as you are always going
> to require a method for escaping characters reserved for your
> technology.

I found escaped HTML in RSS to be mostly an aesthetic problem
(in that it deeply offends my aesthetic sensibilities :-),
but not too hard to process.  Feed the element content into
a tag-soup parser, infer start- and end- tags to turn it into
a tree, and strip out all the elements you don't want showing up
in the aggregator output.  Took me about two hours to code this up
(to be fair, I did use an off-the shelf lexer for the first step).

The biggest problems I've had with RSS have to do with
inconsistent usage.  For instance: some people put a
summary, abstract, or lead paragraph in the <description>
as was intended, others put the whole damn entry in there,
complete with fifteen paragraphs, two bulleted lists, and
six pictures of their cat.  Is <dc:creator> the author's
email address, full name, or a user ID?  I've seen half a
dozen different formats for dates (and many feeds don't even
include them, which is a real PITA since I want everything
sorted reverse-chronologically.)

Then there are the encoding problems.  HTTP Content-Type header
says "text/plain" with no ";charset=" parameter (implying us-ascii),
the XML declaration says "utf-8", but it's actually in iso8859-1.
Variations on this theme abound.

Those are minor annoyances, that don't greatly affect functionality.
<link> vs. <guid> is another matter.  Some feeds put the URL
of the item itself in the <guid>, and the URL of the thing
the item is talking about in the <link>.  Others only use <guid> and
don't include a <link>.  Most, however, put the URL of the item
in the <link> and an opaque ID in the <guid>.

The distinction can, of course, be determined by the
"isPermaLink" attribute on <guid>; if it's "true", or omitted,
then the <guid> is the real link and the <link> is... well,
something else.  If it's "false", then the <link> is the
real link and the <guid> can be ignored.

However, there are differences of opinion on how to capitalize
the attribute.  Some spell it "isPermaLink" (Bactrian), others
spell it "isPermalink" (Dromedary).  Not a problem for
the DPH regexping his way through, but if you're using
a real XML processor adapting to case-insensitivity
is a bit tedious.

(For the record: according to Winer's spec the correct spelling
has two humps.)

Now as far as a headline browser is concerned, the item's URL
is arguably the most important bit of information about it.
It shouldn't take so much effort to locate it.

--Joe English

  jenglish@flightlab.com

Follow-Ups:
- Re: [xml-dev] rss regularis(z)ation
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>

References:
- RE: [xml-dev] rss regularis(z)ation
  - From: "bryan" <bry@itnisk.com>

Prev by Date: RE: [xml-dev] namespaces (was RE: [xml-dev] rss regularis(z)ation)
Next by Date: Re: [xml-dev] namespaces (was RE: [xml-dev] rss regularis(z)ation)
Previous by thread: RE: [xml-dev] rss regularis(z)ation
Next by thread: Re: [xml-dev] rss regularis(z)ation
Index(es):
- Date
- Thread