[
Lists Home |
Date Index |
Thread Index
]
Sometimes, it is hard to follow the law...
As it turns out, even though I've been arguing for insisting
that people generate valid data, my site (http://weblogs.pubsub.com/)
has been accused of generating invalid RSS files. If we don't "fix"
this, we're going to get put on the Syndic8 list of feeds needing
"repair"... I'd appreciate some guidance on how to fix this problem.
The answer isn't intuitively obvious.
What we do at PubSub.com is generate custom, synthetic RSS
feeds. We scan about 100K feeds continuously and let people
"subscribe" to items in those feeds. (Thus, if you want to know every
time "(RSS OR ATOM) AND (BLOG OR FEED)" is mentioned in an RSS feed,
we can help you... When we find a match we insert it into a custom RSS
file being maintained for the subscriber. (In the future, we'll
support other kinds of "delivery". Email, SOAP, XMLRPC, etc..)
The issue with our feeds is that we don't put <language> tags
in them. These tags are defined as optional in RSS V2.0, but there is
no question that having them improves the utility of a feed
significantly and some people consider their absence to constitute a
"broken feed.". Our dilemma is that RSS appears to have been defined
with the assumption that all items in a feed would share a common
language. This is a good assumption when RSS is being used to
syndicate the content of a blog being maintained by a single person,
however, it doesn't work well when the feed is composed of items
sourced from thousands of other feeds. What we need is a <language>
tag on items -- not a single tag for the whole RSS file.
Unfortunately, RSS V2.0 doesn't define item-level <language> tags...
Now, clearly, we could define some new namespace and create an
item-level <language> tag of our own like "<ps:language>". The
difficulty with doing so is that this private tag wouldn't achieve
much more than wasting bandwidth since no known news aggregator knows
what to do with it. This is the case, of course, with many
"extensions" to XML formats... They work within small groups, but are
simply noise when the scope of usage expands since no one supports
them.
It has been suggested that we should do a scan of the
generated feed and determine what language is most commonly used in
the various items that have been collected. However, I don't think
this gets us to any place useful. The problem is that while this might
mean that the channel-level language tag is right for many items, it
will still be wrong for many other items. Also, this means that the
<language> for one of our RSS channels could be changing from minute
to minute as content of one language or another ebbs and flows into
the generated feed.
Our interface allows people to create subscriptions that
restrict the content that is scanned for them to only those that are
marked as being in some specific language. Potentially, we could
insert <language> tags into such single language feeds, but we are
then still left with the issue of what we should do for subscriptions
that specify "any language" as the content source...
One approach to solving this problem would be to simply use a
newly defined language code that indicates ambiguity of language.
Thus, I might use "x-mixed" or "x-unknown". (Until "i-mixed" and
"i-unknown" are registered with IANA to join the "i-default" which is
already registered.) RSS V2.0 defines its language codes via W3C as
compliant with RFC1766 which provides for new language tags to be
defined. We would use one of these tags on feeds which were not
language specific. But, is this the right thing to do? It solves my
problem of needing to have a language tag and of needing to be
explicit about what I'm transmitting, however, it will probably be
some time before news aggregators actually know what to do with such a
tag... Also, registering these tags with IANA could result in other
people using them with potentially negative impacts in other XML
files, etc...
But, perhaps there is some obvious solution that I haven't
considered...
Please consider offering some guidance on this issue and have
a look at our site at http://weblogs.pubsub.com/ . How do I keep my
feeds off the "broken feeds" list?
bob wyman
|