[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
giving xml a colonic: re-opening (or re-opining on) an old argument
- From: Amelia A Lewis <amyzing@talsever.com>
- To: xml-dev@lists.xml.org
- Date: Fri, 9 Sep 2011 23:17:08 -0400
Heyo. Did a release this week (care to look at www.genxdm.org?), so
now I'm stirring up trouble. :-)
So, the namespaces in XML discussion. Yes, yes, I'm going to flog that
horse again. It's not dead *enough*, you see. Or perhaps I'm setting
myself up as a Miracle Max ....
The Namespaces in XML spec is an ongoing problem for XML adoption (and
even now, it's worth talking about adoption; part of the reason that
further adoption has stalled is because of the problems with XML
Namespaces).
Problem: complexity. Here I want to talk about code complexity. To
handle namespaces in XML, you have to keep a "namespace context"
hanging around. That's fine for the processor, but it's a real problem
for applications, which don't have a good way to access the processor's
internals (and shouldn't, either). Changes to what can be done with
namespace-prefix mapping make this even more crufty; depending upon
which version of XML you're supporting, you can undefine the default
prefix mapping, or all prefix mappings. You've got to have a
new-on-change context for every element, but not change the context if
the element doesn't introduce a change. Even simple documents are
weighted down with namespace concrete shoes.
Problem: failure of intuition. Have you ever tried to explain to
someone why the element <element name="name" ... /> is the same as the
element <xs:element name="name" ... />? And why the attributes are the
same? And why the attributes in <type name="name" /> <xs:type
name="name" /> are the same as each other, but not the same as the ones
in element? The most graceful explanation is to tell folks that
elements define a scope for the naming of their attributes, which
therefore don't require prefixes, but it always requires explanation.
Ever noted that qnames in content refer to elements, never (hardly
ever!) to attributes?
Problem: resistance to change. Despite the problems with namespaces,
the code is widespread. It won't be thrown away. And that's the
primary reason to *not* write this email (or for you to stop reading;
you really should, you know): the current status quo, however quirky,
is widely-distributed (and if namespaces in XML aren't properly
supported in your favorite language/library, then you probably don't
care about the issues that the spec was designed to resolve).
But ... namespaces are important. The introduction of namespaces
provides potentially significant power: you can *combine vocabularies*
(or call them schemas or DTDs or whatever). Even browers *could* make
use of this, potentially, mapping namespaces to plugins (except that
the XML implementation is so incredibly nasty that no browser actually
seems to do this).
More importantly, perhaps, namespaces provide a means of distribution
of authority for authoring schemas for different application areas or
industries, which are useful even in mono-namespaced documents.
Instead of W3C taking all the "good names" in the global XML namespace,
everybody can define their own, and use the names that they think good
inside it.
So ... what is a namespace?
Well, according to the Namespaces in XML specification, a namespace "is
a URI." That's utter horseshit, but it's what the spec says. It
doesn't actually have any of the characteristics of a URI, apart from
syntax, but the spec really, really wants to pretend that a URI is what
it is.
So, looking at the rest of the specification, what is it *really*?
Well, it's actually got two potentially different definitions: from the
point of view of a namespace creator/definer, and from the point of
view of a namespace user.
From the point of view of a namespace creator, a namespace is a
(reasonably) unique identifier with a low cost of entry, and
distributed authority. Now, the writers of the URI specification(s)
envisioned, and the writers of particular URI scheme specifications
detailed, a means of leveraging the distributed authority of the domain
name system for creating unique identifiers. DNS alone is inadequate,
unless one assumes that a domain will create only a single namespace
... but the power (and corresponding complexity) of URIs is probably
overkill. You need distributed authority, and a way of distinguishing
multiple work products within the aegis of a single recipient of
authority. That provides a reasonably stable namespace name.
From the point of view of an end-user, a namespace isn't a URI at all.
It's compared for string equality. A namespace is a label, a string,
an array of characters. Equality is all that matters. Any
single-character variation means a different namespace.
We'll come back to this, but let's move a little further. Namespaces
are of interest because they enable XML vocabularies that mix multiple
namespaces. Key examples include XML Schema, XPath, XSLT (and XPath2,
XSLT2, XQuery). The key abstraction in vocabulary merging seems to be
the QName.
What, exactly, is a QName?
Well, it's defined to be a Qualified Name, but ... as with namespaces,
that's horseshit in the well water. Whatever it *really* is,
"qualified" is prolly not the ideal descriptor. Syntactically, a QName
is a combination of NCName:NCName. That's a no-colon-name, a colon,
and a no-colon-name. An obvious appellation would be "colon name", but
that's apt to give rise to unpleasant jokes (not to mention subject
lines), so should be avoided.
It can be described as an abbreviated name. The combination of the
expansion of the prefix (via mapping) with the local part of the name
generates a 'complete' name. Interestingly, though, the prefix (and
even the colon) can be missing, and still <element> is not the same as
<element>.
That's the problem with QNames: they have extremely *poor* locality.
Worse, they extend that poor locality from themselves to every NCName
in XML. You can't know the name of an XML element without looking up
its ancestor axis to generate the namespace context. So, while
Namespaces in XML allows you to embed foreign vocabularies, it makes it
a real challenge to *extract* namespace-well-formed fragments from
multi-namespace documents.
Now, if you're reading xml-dev (and if you've continued this far, which
was a mistake that you should certainly correct as soon as possible; I
recommend beer, wine, or the distilled beverage of your choice), you've
seen all this rehearsed before, I know. And you've seen proposed
solutions. I'm not proposing much of a solution; I'm proposing
something more on the order of a profile, for schemas and instances,
that leaves the existing code infrastructure alone (for the time being).
How do we make namespace in XML less egregious?
Well, first principle: avoid QNames. They demonstrate poor locality,
which makes processing portions of a document challenging. So ...
where are they used?
They're used in content, typically as references (they compete with IDs
and keys as references, mind, but they're also used to describe
'classes' or 'categories' of things). This is anathema, frankly. Any
XML dialect that is using QNames in content (which unfortunately
includes the most-adopted XML technologies: Schema, XPath, XSLT,
XQuery) is broken. It's broken because it fundamentally breaks
layering: namespaces (and consequently QNames) are in the XML processor
layer, while references are implicitly in the application layer.
Exposing namespaces to the application layer universally means that
every application, even those that don't care about namespaces, has to
cope. But namespace handling is unintuitive, and complex. Ugh.
QNames are also used as attribute names. This is, in fact, a necessary
use case, that cannot be worked around. There's an XML namespace, with
attributes in it; every processor can handle those. In order to put
foreign attributes on an element, you have to prefix them (it's the
only way to avoid the potential for name clashes). Foreign attributes
have to be distinguished from native ones; prefixes are the current
solution.
QNames are used for elements. This is simply unnecessary (note:
there's a conflict between namespaces-in-content and
namespaced-elements for Schema and XSLT, at least, but that's a
namespaces-in-context problem, in my opinion).
Some proposed best practices for a namespaces-light set of XML schemas
and instances: no QNames in content. No prefixed elements (change the
mapping of the default prefix: xmlns=). When foreign attributes are
used, define the prefix mapping in the same element (xmlns:prefix=
wherever prefix:name= appears).
But, if we've lost QNames in content, what do we do for references?
Next principle: if an application needs reference semantics, let it
define them itself. The application can then ensure that it never
acquires a chunk of XML that lacks the necessary context for resolution.
A simple implementation of this is to use "expanded" (or "jc") names
instead of QNames in content. This is the combination of namespace and
name into a single unit: {namespace}name (James Clark invented it, so
far as I am aware). For vocabularies in which references are uncommon,
it's a simple, if somewhat cumbersome solution (XML Schema would groan
under the weight, but it makes very extensive use of references by
QName).
If expanded names seem too cumbersome for the frequency of reference,
then define an *application-level* abbreviation or mapping syntax.
Nobody else needs to understand it, after all; it's not going to be
built into your XML processor. If it's an application responsibility,
define it at the application level.
And a final principle: dump the bogus URIs, and use the simplest
namespaces possible.
In namespaces, the 'scheme' portion of a URI provides no information.
Drop it. Alternatively: provide a public example of two XML namespaces
differentiated by scheme.
That leaves domain + path + query | fragment (for common URIS; it's
different for URNs or for the mail: scheme and there are other corner
cases you can generate, I'm sure, but we're going to focus here on
domain-based namespace URIs with some flexibility) (or you can provide
examples of other things; and do note that urn.[urn-pattern] is a
perfectly reasonable 'domain' replacement that won't collide with
DNS). Query and fragment are not generally used, so let's drop them as
well (or you can provide an example of two public XML namespaces
differentiated by query or fragment).
So, domain + path. That can be simplified to reverse-domain naming,
with extension, if you care to. Certainly easier, and it's commonly
encountered in a number of programming languages. It has the further
advantage of removing the nasty non-Name characters, so you could do a
form of expanded name as namespace:name rather than {namespace}name.
The strongest objection to this would be an example of two public XML
namespaces differentiated by identical fragments in the machine name
and initial portion of the path, such that this change would make the
two identical.
Mind you, the above is suggested best practice. No current XML
processor will choke on a URI without a scheme, or with
extended-reverse-domain naming, and the unique strings compared by
users become easier to read and comprehend, with no loss of
distinguishing information. But since we're not proposing a change to
processors at the moment, the most baroquely ugly and evil URIs remain
permissible.
An alternative: define an "xmlns" 'scheme' for URIs (that identify
namespaces, not pretending to identify any other sort of resource),
with a simplified pattern as above. This is slightly more problematic
for the longer term, because a QName has one colon, not two. But such
a scheme definition could otherwise restrict itself to characters
permitted by the XML Name production.
Summarizing: we could, at present, start using simplified domain-based
non-URIs for namespaces, avoid QNames in content (and replace with
application-level mapping as needed), use default-prefix mapping for
elements, and only use non-default prefix mapping
(xmlns:prefix="namespace") for attributes.
This doesn't get us a long way forward, but avoid some problems, and
opens the possibility, if enough people decided to do this, that some
future cleanups to Namespaces in XML (along these lines, that is) could
be implemented.
There are some obvious obstacles. W3C XML Schema is one. It makes
very extensive use of QNames in content, for reference; it commonly
binds the default prefix to the target namespace so that those
references need not be prefixed, in content, which means that the
structure of the schema (the elements) must be prefixed. This is hard
to fix, although it would be straightforward enough for a future
version of schema to drop the use of QNames and adopt a
schema-parser/validator-level mapping instead. But it must be
acknowledged that Schema's use of QNames in content is one of the
primary obstacles to making any change to the status quo.
XPath (and XSLT and various other things, like XPath2 and XSLT2 and
XQuery) are probably easier. XPath currently delegates mapping to its
host language, which means that a host language revision or variant
could use application-level mapping instead of breaking layers by using
XML processor-level mapping. However, a variant of XPath is perfectly
feasible, in which the expressions are enhanced with the inclusion of
namespaces, using the JC expanded-name form. Instead of
//xs:schema/xs:annotation//html:p :
//{org.w3c.xml.ns.schema.2.1}schema/annotation//{org.w3c.html.5.1}/p
(with an apology for the version numbers). XPath could define the
context of a namespace declaration (represented by the {namespace}
particle prior to a name) as either 'descendant' or
'following-in-expression' (some investigation would show which would be
preferred; descendant seems likely, unless there's more use of
non-descendant axes than I've encountered). Note that this would not
include foreign attributes, which would appear as @{namespace}name.
But what's the point? Sure, I can argue that it's best practice to do
some of this stuff, but we've already seen multiple namespace-fixup
proposals die in flames.
Eh. :-) I think that these are best practices, and can be adopted by
folks now, without checking with other people. If they were adopted,
then some of the possible solutions outlined in the final 'problems'
section might see some traction (but nobody's gonna bother doing
expanded names in *major* application dialects unless they've seen
other folks adopting expanded names elsewhere). They probably make
your XML cleaner, more understandable, and work with current processors.
And given enough folks adopting something like this set of practices
(especially with regard to QNames: remove them from content, use
default prefix only for elements, always declare the prefix in the same
element that contains foreign attributes), processors could start to
consider optimization (or, more bluntly: put the crufty namespace code
off in a de-optimized branch that's only invoked when the simple (and
faster) best-practice form isn't working).
Given enough adoption of namespace-simplification (to
extended-reverse-domain, or something equivalent), then a new set of
revisions of core specs might acknowledge that, as well, and might even
permit use of un-mapped names (actual "qualified" names in this case:
org.w3c.xml.ns.schema.3.0.element, for example). And perhaps even move
to a central registry for widely-used vocabularies
(org.w3c.xml.ns.schema.42.0 == xs ?).
What, are you *still* reading? It's the weekend! Go do something fun!
Amy!
--
Amelia A. Lewis amyzing {at} talsever.com
Merchant, street girl, beggar, yeoman,
king or common, man or woman,
only two things make us human--
sorrow and love, sorrow and love ....
-- The Last Song of Sirit Byar
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]