OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] rss regularis(z)ation

[ Lists Home | Date Index | Thread Index ]


bryan wrote:

> Again the most serious RSS problem for me is escaped html, as such it
> indicates a necessary wrong in any technology, as you are always going
> to require a method for escaping characters reserved for your
> technology.

I found escaped HTML in RSS to be mostly an aesthetic problem
(in that it deeply offends my aesthetic sensibilities :-),
but not too hard to process.  Feed the element content into
a tag-soup parser, infer start- and end- tags to turn it into
a tree, and strip out all the elements you don't want showing up
in the aggregator output.  Took me about two hours to code this up
(to be fair, I did use an off-the shelf lexer for the first step).

The biggest problems I've had with RSS have to do with
inconsistent usage.  For instance: some people put a
summary, abstract, or lead paragraph in the <description>
as was intended, others put the whole damn entry in there,
complete with fifteen paragraphs, two bulleted lists, and
six pictures of their cat.  Is <dc:creator> the author's
email address, full name, or a user ID?  I've seen half a
dozen different formats for dates (and many feeds don't even
include them, which is a real PITA since I want everything
sorted reverse-chronologically.)

Then there are the encoding problems.  HTTP Content-Type header
says "text/plain" with no ";charset=" parameter (implying us-ascii),
the XML declaration says "utf-8", but it's actually in iso8859-1.
Variations on this theme abound.

Those are minor annoyances, that don't greatly affect functionality.
<link> vs. <guid> is another matter.  Some feeds put the URL
of the item itself in the <guid>, and the URL of the thing
the item is talking about in the <link>.  Others only use <guid> and
don't include a <link>.  Most, however, put the URL of the item
in the <link> and an opaque ID in the <guid>.

The distinction can, of course, be determined by the
"isPermaLink" attribute on <guid>; if it's "true", or omitted,
then the <guid> is the real link and the <link> is... well,
something else.  If it's "false", then the <link> is the
real link and the <guid> can be ignored.

However, there are differences of opinion on how to capitalize
the attribute.  Some spell it "isPermaLink" (Bactrian), others
spell it "isPermalink" (Dromedary).  Not a problem for
the DPH regexping his way through, but if you're using
a real XML processor adapting to case-insensitivity
is a bit tedious.

(For the record: according to Winer's spec the correct spelling
has two humps.)

Now as far as a headline browser is concerned, the item's URL
is arguably the most important bit of information about it.
It shouldn't take so much effort to locate it.


--Joe English

  jenglish@flightlab.com




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS