[
Lists Home |
Date Index |
Thread Index
]
Bob Foster wrote:
> The naive approach to determining equivalence between two documents
> would be to define a recursive procedure that normalizes and
collects
> element contents into sequences or sets, as appropriate, then
compares
> the result for equality.
I think what you really want to do here is convert the XML to
"canonical form." This is the procedure used to do such comparisons
(as well as to enable signatures) in ASN.1 canonical and distinguished
encodings as well as in the Canonical XML which is defined as part of
XML Signature. For info on Canonical XML, see:
http://www.w3.org/TR/xml-c14n or http://www.ietf.org/rfc/rfc3076.txt
The basic rules for producing canonical XML are summarized as:
* The document is encoded in UTF-8
* Line breaks normalized to #xA on input, before parsing
* Attribute values are normalized, as if by a validating
processor
* Character and parsed entity references are replaced
* CDATA sections are replaced with their character content
* The XML declaration and document type declaration (DTD) are
removed
* Empty elements are converted to start-end tag pairs
* Whitespace outside of the document element and within start
and
end tags is normalized
* All whitespace in character content is retained (excluding
characters removed during line feed normalization)
* Attribute value delimiters are set to quotation marks (double
quotes)
* Special characters in attribute values and character content
are replaced by character references
* Superfluous namespace declarations are removed from each
element
* Default attributes are added to each element
* Lexicographic order is imposed on the namespace declarations
and attributes of each element
See also: "Exclusive Canonical XML" in:
http://www.w3.org/TR/xml-exc-c14n/
All these conversions probably look like a great deal of work,
however, if you study list carefully, you'll soon realize that they
are all required. Of course, the list may be considered still
incomplete if you are doing comparisons since it doesn't address the
issue of ordering adjacent elements according to the lexicographic
order of their values...
Does anyone know if a MIME type is being registered for
Canonical XML? Do we have "CXML" so that I can write stuff like
"application/foobar+cxml" ?
> Suppose that a regular grammar-based schema language
> extended this idea to elements, essentially adding a
> "set" operator to the regular model.
> (This would be in addition to any existing unordered
> sequence operator, such as SGML's &, RELAX NG's & or
> XML Schema's 'all'.)
As long as you are reviewing the capabilities of various XML
schema languages, don't forget to consider the XML schema language
called ASN.1. ...
ASN.1 already has syntax for both ordered and unordered
elements. Unfortunately, it is the exact opposite of what you are
suggesting. In ASN.1 "SEQUENCE"s are ordered and "SET"s are unordered.
Thus, you can have:
ordered ::= SEQUENCE {
mustBeFirst INTEGER,
mustBeSecond INTEGER }
or,
unordered ::= SET {
firstOrSecond INTEGER,
secondOrFirst INTEGER }
It would have been wise if the WXS designers had noted existing
standards and used "SET" rather than "SEQUENCE" when defining WXS...
If nothing else, the ASN.1 definitions seem more logical simply based
on normal usage of the words "SEQUENCE" and "SET". To me, SEQUENCE
seems to imply order while SET simply implies membership in a set or
collection. But, whatever the words may be, you should be aware that
the imposition of ordering requirements in XML may be considered
objectionable and "not in the spirit of XML" by many people.
bob wyman
|