Lists Home |
Date Index |
At 04:28 PM 9/28/2002 -0400, Mike Champion wrote:
>9/28/2002 7:50:50 AM, Jonathan Robie
> >If the applications that use this data require data of the appropriate
> >type, and we want validation to be able to determine whether the contract
> >is being followed, then we have to allow data types to be declared.
>I think there are a number of problems with taking such types all that
>seriously for real *XML*-centric applications, even accepting the
>(pretty reasonable!) argument that the schema should define a contract
>between producers and consumers of data. (I wouldn't quarrel with using
>types extensively in OO programming languages, nor in exploiting SQL types
>in SQL-centric programs; I simply think that XML has other use cases and
>design patterns than these technologies. Disagree? That's another thread!)
I certainly agree with you - after all, I led the development of an
SGML/XML repository for editorial environments in 1995-1996, have spoken at
two conferences on querying ancient Greek manuscripts, and participated at
the first demo for HL7 patient records in XML. I know there are people who
think that XML is only a transient transport layer for objects or
relational data, but I am not one of them.
Still, data types are an extremely general and useful characteristic of
most structured and significant parts of true semi-structured data. They
are also important for XML views of structured data in query systems.
Most systems that allow data declarations allow both structure and type to
be declared. That's true of both programming languages and databases.
Almost all of the post-DTD schema languages that have been proposed allow
simple data types, and makers of native XML databases felt the need to add
data types even before W3C XML Schema existed. If I don't know the data
type, I don't know how to build an index on an element, compare two
elements of the same type, or do many other kinds of basic processing.
>First, a schema that handled your example data in a truly useful way
>would be non-trivial at best (or some non-trivial code would be needed
>to pre-preprocess data to meet it).
>Think of instances such as
><ssn>123 456 789</ssn>
><name>[none of your business]</name>
Right - in fact, simple data types capture a small, but important, part of
the semantics - enough to know how to build an index, determine a valid
range, etc. Again, this is no different from SQL, Java, C++, and the like,
which require data scrubbing in addition to data types. Application
semantics are outside the scope of data types.
In object oriented applications, databases, and other rigidly structured
domains, we have found that complex data structures built from simple typed
data, with names and types for each item, is a level of information that is
generally useful. It is also the level of information needed for generic
programming. With this information, I know how to compare or sort data,
whether it is legitimate to perform certain operations on it, and how to
map it into roughly equivalent representations in SQL, Java, C++, or
whatever. This is important when writing middleware, report writers, or
other tools that provide application-neutral handling of data, or when
persisting data and creating indexes for it.
>Second, think of data that simply can't be validated by syntax. For example:
>Ain't no way a schema validator is going to enforce the contract that those be
>valid prime numbers, customer-ids., etc.
In fact, if you make your data types complex enough to handle this sort of
thing, I think you will have made them too complex to be generally useful.
>The complaint, basically, is that a vastly disproportionate amount of the
>has been spent moving from what would be an "80%" solution (roughly what
>do with RELAX NG, perhaps) to a "90%" solution (maybe 95% ... let's not
>it's very significantly under 100%). This relatively small increase in
>practical effectiveness of the strongly-typed approach over a more
>approach does not justify, in the opinion of many who post here, the immense
>amount of complexity it has added to WXS and XQuery, the difficulty that has
>caused implementers and end users, not to mention the years added to the
>takes to get the specs to Recommendation status.
>So, few would disagree that "it's in the contract". Lots would disagree that
>the amount of effort/complexity added to XML++ to validate the "contract" with
>schema-based mechanisms is worth the cost.
I don't think simple data types are much of the cost or complexity. It's
easy to argue that we don't need as many simple types as W3C XML Schema,
and some of the date/time types are a real headache - in particular, trying
to specify how time zones are handled in all applications may be causing
much more complexity than it is worth. Does anyone remember which
mathematician said something along the lines of, "God created the integer,
all the rest is the work of mankind"? The most basic data types probably
give the most bang for the buck.
The complexity of W3C XML Schema's type hierarchies - distinguishing simple
from complex types by placing them in separate hierarchies, distinguishing
complex types from element types and placing them in separate hierarchies -
has caused much more confusion and complexity than the simple types.
There's just too many hierarchies. So has the notion that a single document
can have validated regions and regions that are not validated. The
inheritance model is complex, and identity constraints are complex. As you
know, RELAX-NG with simple data types is much simpler than W3C XML Schema,
and the people who designed RELAX-NG seem also to believe that support for
simple data types is important.