[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XML 1.0 Conformance Test Results
- From: Richard Tobin <richard@cogsci.ed.ac.uk>
- To: xml-dev@lists.xml.org
- Date: Tue, 12 Jun 2001 13:10:45 +0100 (BST)
> In looking at the sun/valid/not-sa02.xml file, I can't find any tokens that
> that are separated _only_ by character references to whitespace.
You're right, my description was just a shorthand for a more complicated
set of problems. Here is the long, historical version.
There are two aspects to it: attribute value normalization, and validation
of normalized attributes.
NORMALIZATION:
In the first edition of XML 1.0, the description of attribute
normalization was unclear. Were the normalization actions listed in
section 3.3.3 meant to be alternatives, or applied in sequence? They
were meant to be alternatives, but this was not everyone's
interpretation.
Consider the example:
> nmtokens = " this
 also gets  normalized "
If the actions were applied sequentially, the 
 would be first
replaced by a carriage-return character, and then by a space, and
similarly for the 
. The   would of course get replaced by
a space. The result would be
" this also gets normalized "
Assuming that the attribute was of a tokenized type, say NMTOKENS,
it would then get normalized to
"this also gets normalized"
and would be straightforwardly valid.
But that's wrong. The actions are meant to be alternatives. Character
references are replaced by the corresponding characters, but if those
characters happen to be whitespace this doesn't result in them being
converted to spaces. So the result after the first stage of normalization
should be
" this<CR><LF> also gets normalized "
where <CR> and <LF> represent the carriage-return and linefeed characters.
The second stage of normalization would then produce
"this<CR><LF> also gets normalized" (*)
because it compresses strings of space characters, not strings of whitespace.
Erratum 70 (http://www.w3.org/XML/xml-19980210-errata#E70) attempted to
make this clearer, explicitly stating that character references to
CR, LF and TAB do not get normalized to spaces.
VALIDATION:
Normalization is intended to turn tokenized attributes into lists of
tokens separated by single spaces, for easy processing by the
application. To be valid, after normalization, NMTOKENS attributes
must match the Nmtokens production, and ENTITIES and IDREFS attributes
must match the Names production. Unfortunately these production were
given as
[6] Names ::= Name (S Name)*
[8] Nmtokens ::= Nmtoken (S Nmtoken)*
("S" means whitespace).
The effect of this is to make the normalized value marked (*) be
valid, even though normalization has not made it into a list of
space-separated tokens! The intention was to follow SGML, and make
such values be invalid. The mistake was corrected in erratum 62
(http://www.w3.org/XML/xml-19980210-errata#E62) which changed the
productions to
[6] Names ::= Name (#x20 Name)*
[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*
where S has been replaced by #x20.
At this point, all was well. XML was compatible with SGML, and
normalized valid tokenized values were always strings of tokens
separated by single space characters.
Unfortunately, someone queried erratum 62, and in a fit of collective
amnesia the XML Core WG forgot that the validity constraints applied
*after* attribute value normalization. It seemed that perfectly
resonable cases like
nktokens="foo
bar"
had been ruled out (which of course they hadn't). Erratum 108
(http://www.w3.org/XML/xml-19980210-errata#E108) restored the faulty
productions, and worse still this was done immediately before
publication of the second edition.
The mistake was later realized, and erratum 20 to the second edition
(http://www.w3.org/XML/xml-V10-2e-errata#E20) restored the old E62.
In accordance with the law of cartoon amnesia, all is well if you get
hit on the head an even number of times.
The Oasis test suite is particularly confused and the output files
for not-sa02 and sa02 do not match any of the errata.
-- Richard