Re: [xml-dev] Before creating a syntax, create an underlying data model

Hi,

With a database background, my perspective is that:

1. A data model is needed as soon as you want to validate (don't forget the PSVI :-) ), query, process data.

2. A database needs a data model (this is what Edgar Codd called data independence, a very important concept) that the query language can manipulate.

3. Syntax is a representation of the data model that can be used for interoperability and exchanging (importing, exporting) data between different vendors. I tend to see the data model as the upper layer and the syntax as the lower layer, maybe because I think of it in terms of abstraction layers.

4. Whether the syntax or the data model is designed first is more best-practice-related, and, as was said in the other thread, the order in which it was done for XML has historical reasons. In practice, an instance of the data model can be serialized to syntax and syntax can be parsed to an instance of the data model.

5. There is this same idea of syntax and data model for other data formats as well:

- XML and JSON (and protocol buffers, BSON, etc to the extent that bits can be seen as a low level of syntax, too) are the syntaxes that correspond to document stores/hierarchical data.

- CSV is the syntax that corresponds to relational databases and the relational model, queried with SQL.

- XBRL is the syntax that corresponds to data cubes (OLAP, queried with MDX...) (I'm doing a bit of a stretch here, but that's how I see it).

- RDF (or, to be more precise, RDF/XML, Turtle, etc) is the syntax that corresponds to graph databases and triple stores, queried with SPARQL.

My 2 cents :-)

Kind regards,

Ghislain

On Sat, Apr 16, 2016 at 9:08 PM, Costello, Roger L. <costello@mitre.org> wrote:

Hi Folks,

Michael Kay says (paraphrasing):

            It is unwise to define a syntax without
            an underlying data model.

I hereby take this as best practice:

            Every syntax must have an underlying data model.

That is an incredibly important statement, with huge ramifications.

Let’s first begin with the obvious question:

What is a data model?

[Definition] Quintessential: representing the most perfect or typical example of a quality or class.

Arguably, the XQuery and XPath Data Model (XDM) is the quintessential data model. Let’s see how it specifies its data model.

The XDM specification creates and defines terms, specifies how one should view the data, and specifies operations on the data and the results of those operations. Phew! What does all that mean? I think it will become clearer with an example.

Example: The XDM specification introduces and defines a term “sequence.” This term is unique to the data model – there is no such concept in XML. In the XDM view of the world, an element consists of a sequence of items. Here are some of the operations that can be performed on a sequence: select the first item of a sequence, select the last item of a sequence, select the n^th item of a sequence, and so forth.

Let’s recap: the XDM specifies a way to view the data (as a sequence of items), it defines terms, and it specifies operations on values. That’s what a data model is.

Here are two relevant quotes from the XDM specification:

The XQuery 1.0 and XPath 2.0 Data Model
            (henceforth "data model") serves two purposes.
            First, it defines the information contained in the
            input to an XSLT or XQuery processor. Second,
            it defines all permissible values of expressions in
            the XSLT, XQuery, and XPath languages.

The XQuery 1.0 and XPath 2.0 Data Model specifies
           what information in the documents is accessible, but
           it does not specify the programming-language interfaces
           or bindings used to represent or access the data.

The next question to be addressed is this:

            When should a data model be created?

Answer: before you define the syntax.

XML did it wrong. Ditto for Namespaces. They created a data model (the Infoset) after they had already defined the syntax. Bad, bad, bad.

Michael Kay gave a fascinating and illuminating description of how XML and the Infoset were created in the wrong order:

The Infoset spec refers to the XML spec,
            but not the other way around. In terms of
            layering, the Infoset is an overlay on top
            of XML, not an underpinning. So it's not
            an "underlying" data model, rather an
            "overlying" one.

           Yes, the Infoset can be taken as the data model
            for XML. But it is an after-the-event rationalisation;
            it did not influence the design of XML. It also
            came too late to influence other specifications;
            for example in the XPath data model, namespace
            nodes have parents, whereas in the Infoset, namespace
            information items do not.

Wow! That is mind-blowing.

[Why did this happen?]

Lessons Learned:

1. Create a data model that underlies your syntax.
2. Create the data model before creating your syntax.
3. In the data model specify the view of the data, terminology, and operations on the data.

At the outset I stated that there are huge ramifications. Now it’s time to see one of them:

            An XML Schema defines a new syntax so
           before you create an XML Schema, create
           a data model.

Any additions, changes, comments?

/Roger