OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Normalizing XML [was: XML information modeling best practi

[ Lists Home | Date Index | Thread Index ]

Ronald Bourret wrote:

> This is the messaging view of XML documents. It is probably true when the XML
> document is created from an XML-enabled database.
> However, the flexibility of XML means that it is not always true. XML
> documents used to store semi-structured data correspond more closely related
> to rows in a table and the design of the documents corresponds exactly to the
> design of the database. In this case, you could view the XML document as a
> transaction, but could also view it simply as the data
> and inserting it into a database as the transaction.
> There are also XML documents that don't fit the transaction view at all: XSLT
> documents, XML-RPC documents, etc.

Actually, they do--by virtue of being, as you say, documents. Put another way,
at the level at which they are XML--that is, documents--they can (and at that
level must) be processed as text, as lexical entity bodies, regardless of the
semantics which might be assigned to their syntax as XSLT, XML-RPC, etc. This
is precisely the same argument we just saw in the 'XInclude where I bloody want
to' thread. In that thread Uche asked 'if you construct a source document that
uses elements in the namespace and with the name reserved by the XInclude
specification, why on earth would you blame a processor for acting on those
instructions?' The answer is that you wouldn't blame an *XInclude* processor
for doing that, but you might well blame an *XSLT* processor for it. At some
level, if a processor of any sort is to operate upon XML--as XML--then it must
operate on the text as bare syntax, just as Elliotte expects an XSLT processor
to act upon xi:* namespaced nodes, without elaborating from the syntax the
semantics of the includes. Likewise here, XML documents--whether their
semantics are understood by specialized processors to be XSLT, XML-RPC, or
whatever--are, at the level at which they are XML documents, processable by an
appropriate database engine as *data* transactions.

The database engine appropriate for that is one which handles XML--which is to
say marked-up text--as the object of basic CRUD transactions. Such a database
engine must follow the markup of the document where it leads, provided only
that the document itself is well-formed XML. For example, if that engine is
asked to commit a document, then the 'class' of that document is as specified
by the GI and namespace of its root element, which is to say its fully
qualified type in the markup sense. That database engine may not complain if it
is then presented with another document of the same class--which is to say one
which presents in its markup the same fully qualified type--but which exhibits
an entirely different structure beneath the root element. According to its
markup, and therefore as XML, that second document quite simply is another
instance of the same class as the first.

The point is that just as relational database engines operate on records, often
composed through joins of multiple relations, XML database engines should
operate on documents, composed of elements and attributes. It is a premise of
the relational concept that the rows of a given relation are structurally
identical and, by extension, that complex records composed on joins of those
relations are too. That is not only not a premise of XML, but the peculiarly
XML concept of simple well formedness means that there is no expectation that
documents declared as of a given class by their root element will therefore
exhibit the same, or even similar, structure.

So if what we are talking about here is the database handling of XML as XML,
the most important consideration must be the markup. Normalization within a
document, or within the elements of that document is simply alien to the rules
by which XML is structured. You really cannot speak of 'XML documents used to
store semi-structured data', let alone conclude that such documents 'correspond
more closely related to rows in a table and the design of the documents
corresponds exactly to the design of the database'. No matter how many such
documents of a given class you see you cannot presume that any future document
of that class will exhibit the same structure, because that is a constraint
which XML does not impose. Of course, you could limit your database engine to
processing only documents which exhibit a particular structure and content
model, but under such constraints it is no longer an XML database engine, but
only a database engine for a particularly limited class of documents. Of
course, you will have to limit your database engine to exactly such constraints
of working only on a few carefully predefined document types if you want that
engine to read particular semantics from the syntax of a document and process
it in accordance with those semantics, rather than as simple well formed XML
text. I would ask that you please not call that an XML database engine.

A true XML database engine, however, can operate very much as Hugh Chatfield
describes. A document (journal?) is submitted for commit (posting?) to the
larger database of such documents (ledger?) maintained and manipulated by the
database engine. Each of the simple CRUD operations consists primarily of this
commit, and there is very little difference among those operations except in
how cascading changes resulting from the data transaction must be carried out.
The principal effect of the commit is to set the current or most-recent-version
value of the elements (in their fully qualified form) present in the document
committed. Beyond that, specific semantics must in fact be elaborated by custom
processing which recognizes particular syntax. Think of this as database
triggers. Where no trigger processes exist for the specific syntax committed,
either no processing is done or an error can be raised. That is, however, not
an error in the performance of the database engine, which in committing the
document has done exactly what it was designed for. It is an error in the
comprehensiveness of the processing provided for the data actually encountered.
The traditional solution in such cases would be to segregate the unexpected
document, bring a human into the loop, and design appropriate processing for
dealing with the new circumstances.


Walter Perry


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS