[
Lists Home |
Date Index |
Thread Index
]
At pubsub.com, we read about a 100K rss feeds per day and build
"synthetic" feeds based on what we find. (i.e. you can, or soon will
be able to, ask for a custom feed to be generated that contains all
new references to "'Howard Dean' or 'Dr. Dean' or ..."). In the
process of reading all these feeds, we run across quite a bit of junk.
Some of it is non-well-formed XML, but a lot of it is simply failure
to comply with the alleged "specifications" for various versions of
rss. The problem for us is that our service consists of passing on the
items that we find. So, should we as an intermediary be passing on
badly formed chunks of rss (i.e. items) or should we be attempting to
clean them up?
If we pass on the bad stuff, we'll be accused by our clients of
creating badly formed RSS files. On the other hand, if we "clean up"
the stuff we find, we may find that the owners of the source feeds
object to our modifying what they published. Some may thank us for
fixing obvious problems, however, I'm nervous that one day one of our
"cleanup" routines will cause a semantic, not just syntactical, change
in the content... What should we do?
"pubDate" in rss gives a good example of the problem:
In RSS 2.0, a pubData element is supposed to look something like
this:
<pubDate>Thu, 15 Jan 2004 12:59:06 -0500</pubDate>
However, we often see these elements arriving with clearly broken
content. For instance, we'll often see things like:
<pubDate>Thu, 15 January 2004 12:59:06 -0500</pubDate>
Should we consider the presence of "January" rather than "Jan" to
be an error? Or, should we silently clean it up and convert it to
"Jan"?
What should we do with the following?
<pubDate>Thu, 15 Janu 2004 12:59:06 -0500</pubDate>
Should we consider "Janu" to be an abbreviation of "January"? Or,
should we think it is "June"? Should our logic depend on time of year?
(i.e. if closer to June than January, do one thing, if not, do the
other?)
What should we do with a date that appears in this format?
<pubDate>2004-01-11T14:04:00 -5:00</pubDate>
This is not an RFC822 date, however, it is fairly easy to figure
out that it is a date... Should we convert it to RFC822 format? Or,
pass it along as we found it?
What about the dates like those that appear in the feed at
http://www.theblackrepublican.net/rss.xml . They don't use the
optional pubDate field but do provide Dublin Core dates. However, they
encode them as follows:
<dc:date>2004-01-15T08:33:00+-5:00</dc:date>
Notice the "+-" (i.e. these folk are a bit conflicted about what
time zone they are in... They can't decide if they are ahead or
behind...) Should we pass this on as "+5:00" or "-5:00" or just leave
it to clients to figure out what is meant?
I would like to be "conservative" in what I generate, but the
problem is that as an intermediary, I'm being fed a lot of stuff that
was generated "liberally". So, I'm in a bind... One interpretation of
Postel's law would say that I should do my best to output proper RSS
V2.0 while being liberal about what I accept. However, another set of
rules (i.e. intermediaries should minimize how much they muck with
content passing through...) would force me to generate non-conforming
feeds. How do I solve this dilemma?
bob wyman
|