|
Re: [xml-dev] Postel's law, exceptions
|
[
Lists Home |
Date Index |
Thread Index
]
On Jan 13, 2004, at 7:26 PM, Julian Reschke wrote:
It was mis-specified (actually it wasn't specified at all, and as it wasn't UTF-8 it should have been).
OK, so the insistence on the encoding declaration being correct is the "draconian" bit here. Thanks.
I'll not comment on the rest because it seems to say that because of recent advances, we don't need a well-defined markup syntax. Somehow I doub this is true :-)
Not my argument. I'm saying that well defined markup syntax is basically for machine-machine communication (although obviously the übergeeks on this list can hand-author it), so machines are going to be doing the work to produce it from human-authored slop. One still needs good markup specs to define the template of the stuff that the machine creates, and to allow the de-soupification to be done only once in a processing pipeline.
The debate about Postel seems a bit pointless, since there's no way that ordinary humans are going to be trained to be conservative in what they produce and will insist on being liberal in what they consume. The only alternative to despair seems to be to automate the drudgery, ideally in the authoring tool, but more realistically in a downstream filter. (Actually the RSS/Atom debate about this seems to be over whose job it is to de-soupify, the syndicator or the aggregator).
There will be cases where one must insist that no dumb machine "fix" the inputs, such as Tim Bray's example of the ill-formed stock transaction message. I suspect there will be thousands of times more cases, however, where it's more like the mismatch between the encoding declaration and the character set in Sam Ruby's example, and machines can be trusted to do the right thing.
A year ago, I probably would have disagreed, but I've seen how an utterly stupid statistical tool (SpamBayes) has liberated me from spam with a grand total of 1 known false positive (and that was a legitimate message that sounded exactly like a spam, something like "the information you requested is at such-and-such a URL") out of tens of thousands of spams. Dave Raggett's tidy is another example of a fairly dumb program fixing a lot of tag soup with minimal damage to actual content structure. For that matter, Google and the next-generation stuff such as Vivisimo do an awfully good job of making "judgements" from tag soup, using inferred metadata rather than hand-authored metadata. I don't think it requires strong AI to make a really good guess at markup, especially in highly regular content such as weblogs and news feeds.
|
|
|
|
|