xml-dev - Re: [xml-dev] patterns vs. identifiers

Re: [xml-dev] patterns vs. identifiers

[ Lists Home | Date Index | Thread Index ]

To: Mike Champion <mc@xegesis.org>, xml-dev@lists.xml.org
Subject: Re: [xml-dev] patterns vs. identifiers
From: Paul Prescod <paul@prescod.net>
Date: Mon, 19 Aug 2002 22:20:47 -0700
References: <9GE934X08LKNKQO2XA6EC974Y84VR1X.3d61c826@MChamp>

Mike Champion wrote:
> 
> ...
> 
> I may well be over-optimistic; I'm trying to put together some code
> to explore the issue.  For what it's worth, my suspicion that there
> *is* a lot one could do with  fairly simple heuristics was strengthened
> by reading http://www.paulgraham.com/spam.html  (a discussion of
> spam filtering):

There is a big difference between depending on context and depending on
context *heuristically*. Every programming language uses context. Very
few (!) use heurisics.

> " A few simple rules will take a big bite out of your incoming spam.
> Merely looking for the word "click" will catch 79.7% of the emails in
> my spam corpus, with only 1.2% false positives."

That's fine because the price of a wrongly classified email slipping
through is so low. That is rarely the case in many other computer
science applications.

> Also check out Eugene Kuznetzov's article in XML Journal on
> XML-aware network equipment http://www.sys-con.com/xml/articleprint.cfm?id=459
> In discussing the challenge of recognizing a specific XML
> vocabulary and routing messages in that vocabulary to a specialized
> processor, he says "the same device could send messages in a particular
>  XML vocabulary to the server capable of processing them, or it could
> send separate XML-RPC and SOAP messages. The routing rules are specified
> using either proprietary pattern-matching languages or a limited subset of XPath."

But there is nothing heuristical about using XPath! XPaths are precise
matching expressions.

>...
> Also, I really hate to mention this :-) but think of the "wonderful" job
> that browsers do in making sense out of hideously invalid HTML. 

Once again, the cost of getting things wrong is low.

> ... Is there
> any reason to think that that level of creative hackery can't or won't
> be applied to the challenge of making sense out of business messages
> in XML, some of which will come from buggy software, some of which will be
> human edited, some of which will come from organizations that support
> newer versions of some spec than the receiver does, some will be
> generated by software that interprets the ambiguities in the spec differently
> from the receiver, some of which will come from software that "embraces and
> extends" the spec .... ad nauseum?  A "draconian" error handling policy
> just won't be any more viable than it would have been in Netscape 1.0.

I disagree. The cost of getting things wrong is too high. The cost of
coding the heuristics is too high. What percentage of the RDF out there
is non-wellformed. What percentage of XML-RPC messages do not conform to
the standard (modulo bugs in the standard like "ASCII").

-- 
"When I walk on the floor for the final execution, I'll wear a denim 
suit. I'll walk in there like Willie Nelson, John Wayne, Will Smith 
-- Men in Black -- James Brown. Maybe do a Michael Jackson moonwalk."
Congressman James Traficant.

References:
- Re: [xml-dev] patterns vs. identifiers
  - From: Mike Champion <mc@xegesis.org>

Prev by Date: Re: [xml-dev] linking, 80/20
Next by Date: Re: [xml-dev] patterns vs. identifiers
Previous by thread: Re: [xml-dev] patterns vs. identifiers
Next by thread: Re: [xml-dev] patterns vs. identifiers
Index(es):
- Date
- Thread