giving xml a colonic: re-opening (or re-opining on) an old argument

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
From: Amelia A Lewis <amyzing@talsever.com>
To: xml-dev@lists.xml.org
Date: Fri, 9 Sep 2011 23:17:08 -0400
Heyo.  Did a release this week (care to look at www.genxdm.org?), so 
now I'm stirring up trouble.  :-)

So, the namespaces in XML discussion.  Yes, yes, I'm going to flog that 
horse again.  It's not dead *enough*, you see.  Or perhaps I'm setting 
myself up as a Miracle Max ....

The Namespaces in XML spec is an ongoing problem for XML adoption (and 
even now, it's worth talking about adoption; part of the reason that 
further adoption has stalled is because of the problems with XML 
Namespaces).

Problem: complexity.  Here I want to talk about code complexity.  To 
handle namespaces in XML, you have to keep a "namespace context" 
hanging around.  That's fine for the processor, but it's a real problem 
for applications, which don't have a good way to access the processor's 
internals (and shouldn't, either).  Changes to what can be done with 
namespace-prefix mapping make this even more crufty; depending upon 
which version of XML you're supporting, you can undefine the default 
prefix mapping, or all prefix mappings.  You've got to have a 
new-on-change context for every element, but not change the context if 
the element doesn't introduce a change.  Even simple documents are 
weighted down with namespace concrete shoes.

Problem: failure of intuition.  Have you ever tried to explain to 
someone why the element <element name="name" ... /> is the same as the 
element <xs:element name="name" ... />?  And why the attributes are the 
same?  And why the attributes in <type name="name" /> <xs:type 
name="name" /> are the same as each other, but not the same as the ones 
in element?  The most graceful explanation is to tell folks that 
elements define a scope for the naming of their attributes, which 
therefore don't require prefixes, but it always requires explanation.  
Ever noted that qnames in content refer to elements, never (hardly 
ever!) to attributes?

Problem: resistance to change.  Despite the problems with namespaces, 
the code is widespread.  It won't be thrown away.  And that's the 
primary reason to *not* write this email (or for you to stop reading; 
you really should, you know): the current status quo, however quirky, 
is widely-distributed (and if namespaces in XML aren't properly 
supported in your favorite language/library, then you probably don't 
care about the issues that the spec was designed to resolve).

But ... namespaces are important.  The introduction of namespaces 
provides potentially significant power: you can *combine vocabularies* 
(or call them schemas or DTDs or whatever).  Even browers *could* make 
use of this, potentially, mapping namespaces to plugins (except that 
the XML implementation is so incredibly nasty that no browser actually 
seems to do this).

More importantly, perhaps, namespaces provide a means of distribution 
of authority for authoring schemas for different application areas or 
industries, which are useful even in mono-namespaced documents.  
Instead of W3C taking all the "good names" in the global XML namespace, 
everybody can define their own, and use the names that they think good 
inside it.

So ... what is a namespace?

Well, according to the Namespaces in XML specification, a namespace "is 
a URI."  That's utter horseshit, but it's what the spec says.  It 
doesn't actually have any of the characteristics of a URI, apart from 
syntax, but the spec really, really wants to pretend that a URI is what 
it is.

So, looking at the rest of the specification, what is it *really*?  
Well, it's actually got two potentially different definitions: from the 
point of view of a namespace creator/definer, and from the point of 
view of a namespace user.

From the point of view of a namespace creator, a namespace is a 
(reasonably) unique identifier with a low cost of entry, and 
distributed authority.  Now, the writers of the URI specification(s) 
envisioned, and the writers of particular URI scheme specifications 
detailed, a means of leveraging the distributed authority of the domain 
name system for creating unique identifiers.  DNS alone is inadequate, 
unless one assumes that a domain will create only a single namespace 
... but the power (and corresponding complexity) of URIs is probably 
overkill.  You need distributed authority, and a way of distinguishing 
multiple work products within the aegis of a single recipient of 
authority.  That provides a reasonably stable namespace name.

From the point of view of an end-user, a namespace isn't a URI at all.  
It's compared for string equality.  A namespace is a label, a string, 
an array of characters.  Equality is all that matters.  Any 
single-character variation means a different namespace.

We'll come back to this, but let's move a little further.  Namespaces 
are of interest because they enable XML vocabularies that mix multiple 
namespaces.  Key examples include XML Schema, XPath, XSLT (and XPath2, 
XSLT2, XQuery).  The key abstraction in vocabulary merging seems to be 
the QName.

What, exactly, is a QName?

Well, it's defined to be a Qualified Name, but ... as with namespaces, 
that's horseshit in the well water.  Whatever it *really* is, 
"qualified" is prolly not the ideal descriptor.  Syntactically, a QName 
is a combination of NCName:NCName.  That's a no-colon-name, a colon, 
and a no-colon-name.  An obvious appellation would be "colon name", but 
that's apt to give rise to unpleasant jokes (not to mention subject 
lines), so should be avoided.

It can be described as an abbreviated name.  The combination of the 
expansion of the prefix (via mapping) with the local part of the name 
generates a 'complete' name.  Interestingly, though, the prefix (and 
even the colon) can be missing, and still <element> is not the same as 
<element>.

That's the problem with QNames: they have extremely *poor* locality.  
Worse, they extend that poor locality from themselves to every NCName 
in XML.  You can't know the name of an XML element without looking up 
its ancestor axis to generate the namespace context.  So, while 
Namespaces in XML allows you to embed foreign vocabularies, it makes it 
a real challenge to *extract* namespace-well-formed fragments from 
multi-namespace documents.

Now, if you're reading xml-dev (and if you've continued this far, which 
was a mistake that you should certainly correct as soon as possible; I 
recommend beer, wine, or the distilled beverage of your choice), you've 
seen all this rehearsed before, I know.  And you've seen proposed 
solutions.  I'm not proposing much of a solution; I'm proposing 
something more on the order of a profile, for schemas and instances, 
that leaves the existing code infrastructure alone (for the time being).

How do we make namespace in XML less egregious?

Well, first principle: avoid QNames.  They demonstrate poor locality, 
which makes processing portions of a document challenging.  So ... 
where are they used?

They're used in content, typically as references (they compete with IDs 
and keys as references, mind, but they're also used to describe 
'classes' or 'categories' of things).  This is anathema, frankly.  Any 
XML dialect that is using QNames in content (which unfortunately 
includes the most-adopted XML technologies: Schema, XPath, XSLT, 
XQuery) is broken.  It's broken because it fundamentally breaks 
layering: namespaces (and consequently QNames) are in the XML processor 
layer, while references are implicitly in the application layer.  
Exposing namespaces to the application layer universally means that 
every application, even those that don't care about namespaces, has to 
cope.  But namespace handling is unintuitive, and complex.  Ugh.

QNames are also used as attribute names.  This is, in fact, a necessary 
use case, that cannot be worked around.  There's an XML namespace, with 
attributes in it; every processor can handle those.  In order to put 
foreign attributes on an element, you have to prefix them (it's the 
only way to avoid the potential for name clashes).  Foreign attributes 
have to be distinguished from native ones; prefixes are the current 
solution.

QNames are used for elements.  This is simply unnecessary (note: 
there's a conflict between namespaces-in-content and 
namespaced-elements for Schema and XSLT, at least, but that's a 
namespaces-in-context problem, in my opinion).

Some proposed best practices for a namespaces-light set of XML schemas 
and instances: no QNames in content. No prefixed elements (change the 
mapping of the default prefix: xmlns=).  When foreign attributes are 
used, define the prefix mapping in the same element (xmlns:prefix= 
wherever prefix:name= appears).

But, if we've lost QNames in content, what do we do for references?

Next principle: if an application needs reference semantics, let it 
define them itself.  The application can then ensure that it never 
acquires a chunk of XML that lacks the necessary context for resolution.

A simple implementation of this is to use "expanded" (or "jc") names 
instead of QNames in content.  This is the combination of namespace and 
name into a single unit: {namespace}name (James Clark invented it, so 
far as I am aware).  For vocabularies in which references are uncommon, 
it's a simple, if somewhat cumbersome solution (XML Schema would groan 
under the weight, but it makes very extensive use of references by 
QName).

If expanded names seem too cumbersome for the frequency of reference, 
then define an *application-level* abbreviation or mapping syntax.  
Nobody else needs to understand it, after all; it's not going to be 
built into your XML processor.  If it's an application responsibility, 
define it at the application level.

And a final principle: dump the bogus URIs, and use the simplest 
namespaces possible.

In namespaces, the 'scheme' portion of a URI provides no information.  
Drop it.  Alternatively: provide a public example of two XML namespaces 
differentiated by scheme.

That leaves domain + path + query | fragment (for common URIS; it's 
different for URNs or for the mail: scheme and there are other corner 
cases you can generate, I'm sure, but we're going to focus here on 
domain-based namespace URIs with some flexibility) (or you can provide 
examples of other things; and do note that urn.[urn-pattern] is a 
perfectly reasonable 'domain' replacement that won't collide with 
DNS).  Query and fragment are not generally used, so let's drop them as 
well (or you can provide an example of two public XML namespaces 
differentiated by query or fragment).

So, domain + path.  That can be simplified to reverse-domain naming, 
with extension, if you care to.  Certainly easier, and it's commonly 
encountered in a number of programming languages.  It has the further 
advantage of removing the nasty non-Name characters, so you could do a 
form of expanded name as namespace:name rather than {namespace}name.  
The strongest objection to this would be an example of two public XML 
namespaces differentiated by identical fragments in the machine name 
and initial portion of the path, such that this change would make the 
two identical.

Mind you, the above is suggested best practice.  No current XML 
processor will choke on a URI without a scheme, or with 
extended-reverse-domain naming, and the unique strings compared by 
users become easier to read and comprehend, with no loss of 
distinguishing information.  But since we're not proposing a change to 
processors at the moment, the most baroquely ugly and evil URIs remain 
permissible.

An alternative: define an "xmlns" 'scheme' for URIs (that identify 
namespaces, not pretending to identify any other sort of resource), 
with a simplified pattern as above.  This is slightly more problematic 
for the longer term, because a QName has one colon, not two.  But such 
a scheme definition could otherwise restrict itself to characters 
permitted by the XML Name production.

Summarizing: we could, at present, start using simplified domain-based 
non-URIs for namespaces, avoid QNames in content (and replace with 
application-level mapping as needed), use default-prefix mapping for 
elements, and only use non-default prefix mapping 
(xmlns:prefix="namespace") for attributes.

This doesn't get us a long way forward, but avoid some problems, and 
opens the possibility, if enough people decided to do this, that some 
future cleanups to Namespaces in XML (along these lines, that is) could 
be implemented.

There are some obvious obstacles.  W3C XML Schema is one.  It makes 
very extensive use of QNames in content, for reference; it commonly 
binds the default prefix to the target namespace so that those 
references need not be prefixed, in content, which means that the 
structure of the schema (the elements) must be prefixed.  This is hard 
to fix, although it would be straightforward enough for a future 
version of schema to drop the use of QNames and adopt a 
schema-parser/validator-level mapping instead.  But it must be 
acknowledged that Schema's use of QNames in content is one of the 
primary obstacles to making any change to the status quo.

XPath (and XSLT and various other things, like XPath2 and XSLT2 and 
XQuery) are probably easier.  XPath currently delegates mapping to its 
host language, which means that a host language revision or variant 
could use application-level mapping instead of breaking layers by using 
XML processor-level mapping.  However, a variant of XPath is perfectly 
feasible, in which the expressions are enhanced with the inclusion of 
namespaces, using the JC expanded-name form.  Instead of 
//xs:schema/xs:annotation//html:p : 
//{org.w3c.xml.ns.schema.2.1}schema/annotation//{org.w3c.html.5.1}/p 
(with an apology for the version numbers).  XPath could define the 
context of a namespace declaration (represented by the {namespace} 
particle prior to a name) as either 'descendant' or 
'following-in-expression' (some investigation would show which would be 
preferred; descendant seems likely, unless there's more use of 
non-descendant axes than I've encountered).  Note that this would not 
include foreign attributes, which would appear as @{namespace}name.

But what's the point?  Sure, I can argue that it's best practice to do 
some of this stuff, but we've already seen multiple namespace-fixup 
proposals die in flames.

Eh.  :-)  I think that these are best practices, and can be adopted by 
folks now, without checking with other people.  If they were adopted, 
then some of the possible solutions outlined in the final 'problems' 
section might see some traction (but nobody's gonna bother doing 
expanded names in *major* application dialects unless they've seen 
other folks adopting expanded names elsewhere).  They probably make 
your XML cleaner, more understandable, and work with current processors.

And given enough folks adopting something like this set of practices 
(especially with regard to QNames: remove them from content, use 
default prefix only for elements, always declare the prefix in the same 
element that contains foreign attributes), processors could start to 
consider optimization (or, more bluntly: put the crufty namespace code 
off in a de-optimized branch that's only invoked when the simple (and 
faster) best-practice form isn't working).

Given enough adoption of namespace-simplification (to 
extended-reverse-domain, or something equivalent), then a new set of 
revisions of core specs might acknowledge that, as well, and might even 
permit use of un-mapped names (actual "qualified" names in this case: 
org.w3c.xml.ns.schema.3.0.element, for example).  And perhaps even move 
to a central registry for widely-used vocabularies 
(org.w3c.xml.ns.schema.42.0 == xs ?).

What, are you *still* reading?  It's the weekend!  Go do something fun!

Amy!
-- 
Amelia A. Lewis                    amyzing {at} talsever.com
    Merchant, street girl, beggar, yeoman,
    king or common, man or woman,
    only two things make us human--
    sorrow and love, sorrow and love ....
                -- The Last Song of Sirit Byar
Follow-Ups:
- Re: [xml-dev] giving xml a colonic: re-opening (or re-opining on)an old argument
  - From: John Cowan <cowan@mercury.ccil.org>
- Re: [xml-dev] giving xml a colonic: re-opening (or re-opining on)an old argument
  - From: Michael Kay <mike@saxonica.com>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]