xml-dev - Re: Reality Check (was: Why the Infoset?)

Re: Reality Check (was: Why the Infoset?)

[ Lists Home | Date Index | Thread Index ]

From: Rick JELLIFFE <ricko@geotempo.com>
To: xml-dev@xml.org
Date: Fri, 04 Aug 2000 21:33:48 +0800

Sean McGrath wrote:
> 
> All,
> 
> Of the 6 parsers listed as of today on
> http://xmlconf.sourceforge.net none
> of them fully conform to XML 1.0.
> 
> Our debates on this list so often
> pre-suppose XML compliant
> tools. What does it matter what we decide to
> put in/leave out of an infoset when there are
> no tools capable of generating it anyway:-(

I think this may be a little misleading for some readers.  It may give
the idea that the bugs make it impossible to make reliable XML systems.

Looking quickly through the test results, it seems to me that the 
data says people are very, very well-served if they send standalone 
WF XML that is
 * well-formed with no encoding errors
 * conservative with whitespace and name characters
 * avoids complex uses of parameter entities
Developers writing systems that receive XML should pay
attention to whitespace (newline substitution, stripping
leading or trailing whitespace in attribute values,
whitespace in mixed content next to elements).
At this point, 95% of XML developers can say "Oh, I'm OK 
and leave", probably!

Other perspectives on interpreting the data are welcome.

I would much prefer the test report to be categorized into
 1) error on good document
 2) no error on bad document

Lets roughly categorise the WF test results into 4 groups:
   * instance misparsing
   * prolog/subsets/entity errors
   * whitespace errors
   * Incorrect or missing diagnostic errors.

If we look at things in those terms, we find for the WF tests
(I apologise for any mistakes, this is a quick count)

Parser  succeed instance  prolog whitespace diagnostics***
-------------------------------------------------------
Sun      1066     0       0       0          6 (xml:lang)
Aelfred  1062     0       7       0          3
XP       1057     0       9       5          0
Xerces/C 1043     3       26      0          0
Xerces/J 1020     0       0       46         6 (xml:lang)
MS200*    963     60      30      21         5
IE5* **   943    similar but more ZZZZ

* The high numbers here do not indicate a hugely greater number
of bugs than the other WF parsers. The same class of errors
is being caught repeatedly. The whitespace errors mainly
concern normalization of attribute values. Most of the
instance errors are related to handling bad characters or
handling or not handling non-ASCII name characters.

** IE5 seemed to have pretty much the same bugs as MS2000.
Like MS2000, many of the bugs relate to the parser being
too generous in what it accepts. It seems their dialect
is a little friendlier for HTML-ish mistakes (but they should
attend to it, or provide a dual mode parser "xml" and "html" or
best "xml" and "sgml").  However, both IE5 and MS2000 do fail 
sometimes when they should not: I didn't look at the test
to find out why but it is probably important.

*** Some of the tests represent trivial problems: for
example, that xml:lang="123" is not treated as an error--
this is a sanity check of the creator of the document rather
than the parser! Of course, the test suite is correct to
test it: but when reading the numbers one should realize
that not all errors need to be weighted equally. 

I note that the tests of the Validating parsers seem to
have mostly the same errors. One cannot say that supporting
validation introduces a significant set of bugs.  Sun's parser
is clearly the parser of choice for conformance.

Rick Jelliffe

References:
- Re: Why the Infoset?
  - From: John Aldridge <john.aldridge@informatix.co.uk>
- Reality Check (was: Why the Infoset?)
  - From: Sean McGrath <sean@digitome.com>

Prev by Date: Re: Arbitrary Infoset boundaries (was Re: Common XML - Final
Next by Date: RE: W3C, responsibility (Re: Why the Infoset?)
Previous by thread: Reality Check (was: Why the Infoset?)
Next by thread: Re: Why the Infoset?
Index(es):
- Date
- Thread