xml-dev - Re: [xml-dev] Datatypes - it's in the contract

Re: [xml-dev] Datatypes - it's in the contract

[ Lists Home | Date Index | Thread Index ]

To: Mike Champion <mc@xegesis.org>,xml-dev@lists.xml.org
Subject: Re: [xml-dev] Datatypes - it's in the contract
From: Jonathan Robie <jonathan.robie@datadirect-technologies.com>
Date: Mon, 30 Sep 2002 07:10:35 -0400
In-reply-to: <MI3FBUR97NI86OKSQHCE9VSE9WT2VOK.3d9610f6@MChamp>
References: < <5.1.0.14.0.20020928074203.022ebec8@ncmail.datadirect-technologies.com>

At 04:28 PM 9/28/2002 -0400, Mike Champion wrote:
>9/28/2002 7:50:50 AM, Jonathan Robie 
><jonathan.robie@datadirect-technologies.com> wrote:
>
> >If the applications that use this data require data of the appropriate
> >type, and we want validation to be able to determine whether the contract
> >is being followed, then we have to allow data types to be declared.
>
>I think there are a number of problems with taking such types all that
>seriously for real *XML*-centric applications, even accepting the
>(pretty reasonable!) argument that the schema should define a contract
>between producers and consumers of data.   (I wouldn't quarrel with using
>types extensively in OO programming languages, nor in exploiting SQL types
>in SQL-centric programs; I simply think that XML has other use cases and
>design patterns than these technologies.  Disagree?  That's another thread!)

I certainly agree with you - after all, I led the development of an 
SGML/XML repository for editorial environments in 1995-1996, have spoken at 
two conferences on querying ancient Greek manuscripts, and participated at 
the first demo for HL7 patient records in XML. I know there are people who 
think that XML is only a transient transport layer for objects or 
relational data, but I am not one of them.

Still, data types are an extremely general and useful characteristic of 
most structured and significant parts of true semi-structured data. They 
are also important for XML views of structured data in query systems.

Most systems that allow data declarations allow both structure and type to 
be declared. That's true of both programming languages and databases. 
Almost all of the post-DTD schema languages that have been proposed allow 
simple data types, and makers of native XML databases felt the need to add 
data types even before W3C XML Schema existed. If I don't know the data 
type, I don't know how to build an index on an element, compare two 
elements of the same type, or do many other kinds of basic processing.

>First, a schema that handled your example data in a truly useful way
>would be non-trivial at best (or some non-trivial code would be needed
>to pre-preprocess data to meet it).
>
>Think of instances such as
><person>
><ssn>123-456-789</ssn>
><name>THX-1135</name>
><children>3.0</children>
></person>
>
><person>
><ssn>123 456    789</ssn>
><name>[none of your business]</name>
><children>three</children>
></person>

Right - in fact, simple data types capture a small, but important, part of 
the semantics - enough to know how to build an index, determine a valid 
range, etc. Again, this is no different from SQL, Java, C++, and the like, 
which require data scrubbing in addition to data types. Application 
semantics are outside the scope of data types.

In object oriented applications, databases, and other rigidly structured 
domains, we have found that complex data structures built from simple typed 
data, with names and types for each item, is a level of information that is 
generally useful. It is also the level of information needed for generic 
programming. With this information, I know how to compare or sort data, 
whether it is legitimate to perform certain operations on it, and how to 
map it into roughly equivalent representations in SQL, Java, C++, or 
whatever. This is important when writing middleware, report writers, or 
other tools that provide application-neutral handling of data, or when 
persisting data and creating indexes for it.

>Second, think of data that simply can't be validated by syntax.  For example:
>
><prime-number-public-key>120349812304897210349876786238746</prime-number-public-key>
><customer-id>666-1313-0000<customer-id>
>
>Ain't no way a schema validator is going to enforce the contract that those be
>valid prime numbers, customer-ids., etc.

In fact, if you make your data types complex enough to handle this sort of 
thing, I think you will have made them too complex to be generally useful.

>The complaint, basically, is that a vastly disproportionate amount of the 
>W3C's effort
>has been spent moving from what would be an "80%" solution (roughly what 
>one can
>do with RELAX NG, perhaps) to a "90%" solution (maybe 95% ... let's not 
>quibble ...
>it's very significantly under 100%).  This relatively small increase in 
>the actual
>practical effectiveness of the strongly-typed approach over a more 
>weakly-typed
>approach does not justify, in the opinion of many who post here, the immense
>amount of complexity it has added to WXS and XQuery, the difficulty that has
>caused implementers and end users, not to mention the years added to the 
>time it
>takes to get the specs to Recommendation status.
>
>So, few would disagree that "it's in the contract".  Lots would disagree that
>the amount of effort/complexity added to XML++ to validate the "contract" with
>schema-based mechanisms is worth the cost.

I don't think simple data types are much of the cost or complexity. It's 
easy to argue that we don't need as many simple types as W3C XML Schema, 
and some of the date/time types are a real headache - in particular, trying 
to specify how time zones are handled in all applications may be causing 
much more complexity than it is worth. Does anyone remember which 
mathematician said something along the lines of, "God created the integer, 
all the rest is the work of mankind"? The most basic data types probably 
give the most bang for the buck.

The complexity of W3C XML Schema's type hierarchies - distinguishing simple 
from complex types by placing them in separate hierarchies, distinguishing 
complex types from element types and placing them in separate hierarchies - 
has caused much more confusion and complexity than the simple types. 
There's just too many hierarchies. So has the notion that a single document 
can have validated regions and regions that are not validated. The 
inheritance model is complex, and identity constraints are complex. As you 
know, RELAX-NG with simple data types is much simpler than W3C XML Schema, 
and the people who designed RELAX-NG seem also to believe that support for 
simple data types is important.

Jonathan

Follow-Ups:
- Re: [xml-dev] Datatypes - it's in the contract
  - From: Mike Champion <mc@xegesis.org>

References:
- Datatypes - it's in the contract
  - From: Jonathan Robie <jonathan.robie@datadirect-technologies.com>
- Re: [xml-dev] Datatypes - it's in the contract
  - From: Mike Champion <mc@xegesis.org>

Prev by Date: RE: [xml-dev] Rethinking namespaces, attribute remapping (was Re: [xml-dev] TAG on HLink)
Next by Date: Re: [xml-dev] limits of the generic
Previous by thread: Re: [xml-dev] Datatypes - it's in the contract
Next by thread: Re: [xml-dev] Datatypes - it's in the contract
Index(es):
- Date
- Thread