Re: [xml-dev] XML Schema as a data modeling tool

On 01/10/2013, at 7:54 PM, Stephen Cameron <steve.cameron.62@gmail.com> wrote:

To me it does come down to a question of cost vs flexibility.

The hierarchical approach has proven a suitable basis for the web (a network of tree 'documents') and has also remained important in some specific database scenarios, where the data can be modelled 'adequately' that way and performance is critical, specifically medical records and financial accounts. I read that the hierarchical MUMPS database/language is still important in these scenarios. It's also the underlying foundation of several significant object database products InterSystems Cache (being used for a European Space Agency Star mapping project) and interestingly EyeDB which was originally IDB and developed for one of the initial human genome projects.

Hierarchical databases are very performant, provided you just traverse the hierarchies. This seems to me to offer very significant cost benefits, both in the effort of creating such a db (yes, the data-model is relatively intuitive), and in getting the information from the user to the database and back again (e.g. XRX).

Hierarchical databases lack flexibility in queries and that is where relational databases have their strength, and also in preventing update anomalies from occurring. But doing lots of joins is expensive and so people have abandoned relational databases for very large datastores and gone to the 'document' (JSON) databases of the NoSQL movement (take your pick of the options there).

My personal attitude to relational databases was dented when a scientist in a meeting described my lovingly crafted ER diagram as "another horrendogram". Pretty soon after that I discovered XML databases and XForms and I became interested in XML Schema as a 'hierarchical data-model', not just a means of validating XML documents. To my mind such data-models must sit in the middle of everything else (go XForms!!)

The 'Domain Driven Design' school of Eric Evans places 'clarification of terminology' at its centre in order to build valid models, or even to decide if more than one model of the same things is needed for different groups.

As Hans said hierarchical diagrams can be well understood by most. ER diagrams and UML not so. I spent most of a year trying to convince colleagues of the validity of this point of view at my last job, towards the end I'd half convinced some scientists, the IT folk weren't that interested sadly, XML is virtually unused in science, other than HTML, I feel.

It seems to me that the navigational capabilities of XQuery via XPath actually goes a long way to overcoming this querying limitation of the hierarchical databases, and given that the XQuery is integrated with indexes in XML databases, it becomes very powerful I think.

In terms of the limitations of XML Schema for modelling, to me these are overcome by thinking about the nature of relationships, we can think initially about everything being separate, like in an OO model, and references connecting things together, this is the graph approach. But then you can logically 'aggregate' some child things within a parent thing, they are 'owned' and don't have any logical reason to be kept separate (e.g. medical record and bank account examples).

But what if there is more than one owner i.e. a many-to-many, relationship? My solution proposal, qualify the relationship (parent/non-exclusive-ownership/co-owner=IDREF).

One big issue in relational databases is code list tables, these are kept separate in tables for normalisation, in a hierarchical system this must be completely ignored I wonder, storing a text-string code instead of a numeric foreign key takes more memory but I think that this is not now very important, update anomalies can be an issue, but only if the system is dumb enough to allow it to happen (the client needs to know the schema as well as the server). Again its a question of the nature of the relationship, when is something to be a code list member?.

Anyway, its good that this discussion is occurring, I suspect this is all old news in the web standards world but now starting to become more relevant to databases with the 'hegemony' of the relational model being challenged.

Hope this is of interest, its been a big question in my mind for a while now.

Steve Cameron

On Tue, Oct 1, 2013 at 7:23 PM, Michael Kay <mike@saxonica.com> wrote:

Are we talking about a shopping cart with wheels, or one that exists only in an online shopping application?

If we're talking about the latter, then we're not talking about modelling the real world, we are talking about designing an electronic virtual world. The two tasks have similarities, but they are not at all the same.

But either way, I wouldn't go anywhere near concepts such as "table rows" to do the modelling.

(Incidentally, one thing that people often forget about the Codasyl or network database model is that the records were - at least in principle - hierarchic rather than flat. So this was indeed a network of trees. But to call it a forest would be very misleading, because the trees in a forest have no relationships to each other, and it's these relationships that are so important.)

Michael Kay
Saxonica

On 1 Oct 2013, at 09:22, Hans-Juergen Rennau wrote:

May I repeat what I take to be a main point? There is no debate about complex systems consisting of entities which cannot be brought into a single hierarchy on a durable basis. So the system as a whole is a "network" - yes.

The question I have is about the entities of which the network is composed - what is their granularity? What is our sub model for those participants? I maintain that there are situations (modelling tasks) when the participants of the network are sufficiently complex to be represented by trees, rather than table rows or flat objects. Then the network is a network of trees.

Doing so promises a dramatic reduction of complexity. A touristic shopping cart conceived as a set of 100 "networked" table rows makes no sense; conceived as a tree it makes sense. The shopping cart itself is a participant in an enterprise model which is a large network comprising many other entity types. Many are also quite complex, and whether or not we choose to regard such entities as trees (and thus the network as a forest) makes a difference.

For example, looking for five minutes at a concise tree representation, you usually have a quite good grasp of what is in, and what is not, even if the tree has 500 items. Suppose you have this question: which tour operator related data are contained in the concept "shopping cart"? Provided the naming of elements and attributes has been done carefully, you can answer the question within one minute,and you are confident to have given the complete answer.

So wrapping up: I do not talk about replacing a network by a tree, but by letting the network be a network of trees.

Kind regards,
Hans-Juergen

Von: Damian Morris <damian@moso.com.au>
An: Michael Kay <mike@saxonica.com>; David Lee <dlee@calldei.com>; Hans-Juergen Rennau <hrennau@yahoo.de>; William Velasquez <wvelasquez@visiontecnologica.com>
CC: xml-dev@lists.xml.org
Gesendet: 9:41 Dienstag, 1.Oktober 2013
Betreff: Re: [xml-dev] XML Schema as a data modeling tool

First of all, +1 for everything Michael has said on this subject. (Although, you can usually take that as a given.)

Perhaps another way of looking at what Michael is saying is that it is the difference between "X contains a Y" and "X has a Y" - for example, an Address can be considered to "contain" a Street - the Street is intrinsic to the Address; however, a Person "has" an Address - there is a relationship there, but the former does not contain the latter in any meaningful sense.

Hierarchical models better suit "contains" relationships, but the real world is full of "has" relationships, which are more naturally described by network models.

Cheers,

Damian