[
Lists Home |
Date Index |
Thread Index
]
Mike Champion wrote:
>
> ...
>
> I may well be over-optimistic; I'm trying to put together some code
> to explore the issue. For what it's worth, my suspicion that there
> *is* a lot one could do with fairly simple heuristics was strengthened
> by reading http://www.paulgraham.com/spam.html (a discussion of
> spam filtering):
There is a big difference between depending on context and depending on
context *heuristically*. Every programming language uses context. Very
few (!) use heurisics.
> " A few simple rules will take a big bite out of your incoming spam.
> Merely looking for the word "click" will catch 79.7% of the emails in
> my spam corpus, with only 1.2% false positives."
That's fine because the price of a wrongly classified email slipping
through is so low. That is rarely the case in many other computer
science applications.
> Also check out Eugene Kuznetzov's article in XML Journal on
> XML-aware network equipment http://www.sys-con.com/xml/articleprint.cfm?id=459
> In discussing the challenge of recognizing a specific XML
> vocabulary and routing messages in that vocabulary to a specialized
> processor, he says "the same device could send messages in a particular
> XML vocabulary to the server capable of processing them, or it could
> send separate XML-RPC and SOAP messages. The routing rules are specified
> using either proprietary pattern-matching languages or a limited subset of XPath."
But there is nothing heuristical about using XPath! XPaths are precise
matching expressions.
>...
> Also, I really hate to mention this :-) but think of the "wonderful" job
> that browsers do in making sense out of hideously invalid HTML.
Once again, the cost of getting things wrong is low.
> ... Is there
> any reason to think that that level of creative hackery can't or won't
> be applied to the challenge of making sense out of business messages
> in XML, some of which will come from buggy software, some of which will be
> human edited, some of which will come from organizations that support
> newer versions of some spec than the receiver does, some will be
> generated by software that interprets the ambiguities in the spec differently
> from the receiver, some of which will come from software that "embraces and
> extends" the spec .... ad nauseum? A "draconian" error handling policy
> just won't be any more viable than it would have been in Netscape 1.0.
I disagree. The cost of getting things wrong is too high. The cost of
coding the heuristics is too high. What percentage of the RDF out there
is non-wellformed. What percentage of XML-RPC messages do not conform to
the standard (modulo bugs in the standard like "ASCII").
--
"When I walk on the floor for the final execution, I'll wear a denim
suit. I'll walk in there like Willie Nelson, John Wayne, Will Smith
-- Men in Black -- James Brown. Maybe do a Michael Jackson moonwalk."
Congressman James Traficant.
|