OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] A plea for Sanity

[ Lists Home | Date Index | Thread Index ]


Great post.


----- Original Message ----- 
From: "Joe English" <jenglish@flightlab.com>
To: <xml-dev@lists.xml.org>
Sent: Friday, April 05, 2002 11:43 AM
Subject: [xml-dev] A plea for Sanity

> [ Also sent to xml-names-editor@w3.org ]
> "Namespaces in XML 1.1 Requirements" cites the ability to "undeclare"
> a namespace as the principal (only?) new needed feature, because
> of the case where:
> | information items [...] from another document [...] may
> | have fewer in-scope namespaces than their parent.  There is
> | no mechanism for accurately serializing this situation. If
> | the infoset is naively serialized and reparsed, the children
> | will end up with additional namespace information items which
> | serve no useful purpose.
> I believe that this requirement is ill-considered.
> Under SGML and XML 1.0, applications can treat generic
> identifiers as atomic strings; with XML 1.0 + Namespaces,
> element and attribute names become compound objects consisting
> of a URI and a local name.  This complicates applications a bit,
> but by itself is not an onerous burden: toolkits like SAX can
> provide namespace processors that keep track of the namespace
> environment, map GIs to {URI+localname} pairs, and throw away
> the original namespace declarations.
> The real complexity starts to show up in applications which
> themselves need to keep track of the namespace environment
> (e.g., XSLT).  This is usually required for applications that
> need to reserialize an Infoset as XML and wish to retain
> the original namespace prefixes on output.  (It gets hairier
> for markup vocabularies that include QNames in content, but that's
> a different issue.)
> But the new requirement implies that the *exact set of in-scope
> namespaces at each node* is an essential part of the Infoset.
> This is the part that I think is ill-considered.  This property
> should be deemed inessential, just as whitespace in tags and the
> order of attribute value specifications are deemed inessential.
> XML-related specifications should not expect or demand that it be 
> preserved; any set of namespace declarations that produce the same 
> {URI+localname} pairs after namespace processing should be considered 
> equivalent.
> In particular, "additional namespace information items which
> serve no useful purpose" -- and hence do not affect the interpretation
> of QNames in markup or content -- should not matter.  Applications
> should be free to insert or discard them as they see fit without
> changing the meaning of the Infoset.
>  * * *
> Now a plea for sanity.
> (This is for people who design XML vocabularies and applications;
> xml-names-editor, I know you're busy, so you can stop reading here.)
> There are certain practices which, if avoided, can make life
> simpler for application and toolkit developers.  These are
> all legal according to the Namespaces REC, and I don't suggest
> that they be disallowed in XML 1.1, but it may be beneficial
> for individual applications to disallow them.
> Some definitions:
> Let's say that an XML document is _neurotic_ if it maps the same
> namespace prefix to two different namespace URIs at different
> points.  Neurosis makes it necessary for XML processors to
> work with {URI+localname} pairs instead of GIs, and to keep
> track of the namespace environment at each point in the tree
> if there are QNames-in-content.  If it weren't for neurosis,
> applications could use a single namespace map that applied to
> the entire document.
> Conversely, a document is _borderline_ if it maps two different
> namespace prefixes to the same namespace URI.  Borderline documents
> complicate reserialization: the choice of which prefix to
> use for a particular {URI+localname} pair depends on its
> position in the tree.
> A document is _psychotic_ if it maps two different namespace prefixes
> to the same URI _in the same scope_.  Psychosis presents an even
> bigger difficulty for reserialization: now applications must keep
> track of the original prefix as well as the {URI+localname} pair.
> A document is _normal_ (or _in namespace-normal form_) if all
> namespace declarations appear on the root element and it is
> not psychotic.  (A borderline document with all namespace 
> declarations in the same place is automatically psychotic;
> a neurotic document with this property would be illegal according
> to the Namespaces REC.)
> Normal documents are the easiest to process: the application can
> determine the global namespace environment at the beginning of the
> parse, and can use it throughout processing.
> It's not always possible to produce normal documents -- the producer
> might not know all of the relevant namespaces at the time it emits
> the root element start-tag -- so a weaker definition is useful:
> A document is _sane_ if it is neither neurotic nor borderline.
> Document producers should be designed to emit sane documents.
> This is not hard to do -- the serializer just needs to maintain
> a monotonic, bijective URI/prefix map and reuse the same prefix
> whenever a namespace URI leaves and comes back into scope.
> ("Bijective": there is precisely one URI for each prefix and
> one prefix for each URI; by "monotonic" I mean that prefix+URI
> pairs may be added to the map but not removed.)
> A sane document can be transformed into a normal document simply
> by moving all namespace declarations to the root element and
> filtering out duplicates.  (This can't be done in streaming
> mode, but it might be an appropriate technique for XML databases.)
> Now general-purpose XML consumers cannot expect to receive sane
> documents.  However *special-purpose* consumers, designed to work
> with specific markup vocabularies, can be a lot simpler if the
> markup vocabulary includes namespace sanity as a requirement.
> As an application developer, I'd prefer not to have to worry
> about namespace nodes or {URI+localname} pairs.  I'd rather be
> able to give the parser an internal namespace map describing
> all the namespace URIs I'm interested in, and have the parser
> translate QNames in markup to use my prefixes.  Then the application
> can work with GIs instead of {URI+localname} pairs.  If the source
> document is sane, then it's possible to preserve the original prefixes
> on reserialization simply by remembering the original namespace map;
> it's not necessary to keep track of namespace nodes during processing.
> QNames in content are a lot easier to process in a sane document.
> Sanity guarantees that a given QName means the same thing wherever
> it appears.  Any future markup vocabulary which uses QNames in content
> should include sanity as an application requirement.
> A requirement for sanity shifts part of the burden onto document
> producers, where it's easy to handle.  The alternative is maddening
> complexity for document consumers.
> --Joe English
>   jenglish@flightlab.com
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> The list archives are at http://lists.xml.org/archives/xml-dev/
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS