Lists Home |
Date Index |
From: "Gustaf Liljegren" <firstname.lastname@example.org>
> Ever since XML Schema started to evolve and the talk about datatypes in XML
> took off, I've been wondering secretly why XML validation needs the concept
> of datatypes at all. XML is a plain text format, so content validation in
> XML should be no different from regular pattern matching. Or why should it?
Well, for a start "XML" has no needs, not being a person! It is users who have
needs, so the focus of any question has to be users.
Some users do not need datatypes. For example, people who are just sending
strings in graphs to each other in their documents. Publishing tends to this
extreme. Or people who are confident that their data values are valid
(because it was validated at data capture and the recipient trusts the sender.)
Or people who have such high transaction rates they cannot really afford any
more checks, or who have checks already built in by subsequent stages.
But many other users do want datatyping, because they want to perform
QA on outgoing data or QC on incoming data. For editing, datatype
checking can allow friendlier messages so that problems can be
fixed at source rather than requiring technical personel in the middle
of the chain (or worse, at the far end) to make the data right. For
programmers, you would be aware of the big trend towards prorgramming-
by-contact (in nice Bertram Meyer's terms) which has seen assertions
added to Java 1.4: datatyping (and validation languages in general) have
a good use for making invariants explicit, and for use in unit testing.
Datatypes have a use more than just for validation: if the datatype
aligns with a "storage types" (e.g. its constrains a number's value
space to whole numbers 0-255 which will fit into a byte) it can
be used to drive interface builders: for example to make a schema-specific
DOM that stores data very efficiently.
If the datatype expresses its semantics (e.g. "this is a date") then
it allows conversion between different lexical forms (e.g. US gregorian
date to Australian gregorian data) and translation between different
value spaces (e.g. between Gregorian calendar and the Islamic
Calendar, assuming for the point of argument that they are different
So you can see that there are actually categories of datatyping:
* storage aligned
and that there is no universal agreement (or reason to expect or
want one) one which is better or best or appropriate or wrong.
Even the issue of "should these be separate layers or should
these be mixed?" has no concensus.
For example, in the WC Schema specs we find strings (value
constraining), bytes (storage aligned) and dates (semantic).
But at W3C we also find RDF Schemas which is much more
concerned with (a framework for) semantics.
Another aspect of datatyping is whether to express it
declaratively or functionally: do you say "this is positiveNumber"
or "this is a number > 0"? In the first case, which is more
declarative IYKWIM, a system can easily figure out
the value constraints, the storage alignment (and perhaps
the semantics.) You can use Schematron for lots
of datatyping, but it is functional not declarative in
that sense: one of the reasons for XML Schemas
building in so many derived simple types is to make
it easier to figure out the storage alignment and semantics.
(The proof of the pudding is always in the eating, of course.)
So Schematron datatyping is good for validation but not
much use for figuring out efficient storage structures
(of course, this was not a goal!)
It is tempting to conclude from the above that "some
people need less datatyping, some people need more;
some people need just lexical typing, some people need
value typing, some people need storage or semantic
datatyping". That is true as far as it goes, but it
hides two essential points, which are at the heart of
the datatyping problem. Your answer to these will
largely determine many technical choices you make:
1) Should datatyping be proscriptive or descriptive?
2) Is there structure inside data values
PROSCRIPTIVE or DESCRIPTIVE
The proscriptive approach is exemplied by XML Schemas
(though tempered for practicality by its derivation facilities).
It says "you can only use one lexical form, and we supply
a comprehensive list of built-ins; anything outside that
you simulate using regex checking and providing your
own validators". People who favour the proscriptive
approach tend to feel that users are always shielded
from actual XML values by user interfaces, so in
a sense a lot of the value comes from everyone standardizing
on the same set of types rather than from the completeness
of the types themselves.
The descriptive approach says "I have data in a particular
preferred lexical form, and I want markup to describe it"
In this view, user may edit the XML as text or only have
thin interfaces where the user types the value directly.
The documents may well be stored in text files where
there is no mediating infrastructure to perform conversions.
For example, "You want to send in your American documents
and I want to send my Australian documents with
and we want the recieving system to validate them both
as dates, and allow mixtures and comparison."
The no-structure approach is exemplified by XML Schemas
(though tempered for practicality by lists and unions). The view
can be characterized as "We only need to worry about
explicit XML structure." In other words, only elements are
of interest for validation.
The non-XML structure approach says that the idea that there
is only element structure containing atomic types flies in the
face of how people actually use (and want to use) XML.
It is a 3rd normal form assumption that can be refuted merely
by looking at almost any real DTD (not being a DTD used
for data transfer to or from DBMS): for example, XHTML,
SVG, XMLFO, etc.
In this view, the XML Schemas division between simple
types and complex types is weak: there is a missing
level of non-XML structures which XML Schemas will either
model badly (as strings) or model as if they are simple types
(such as gDates, and therefore get into trouble) or not at
all (such as measure="1cm 2inch 3 em" )
In this view, there are actually probably very few primitive
datatypes [numbers, boolean, string, symbol?] but a variety
of tokenizing rules (space separated, Unicode block separated,
COBOL-style pictures, punctuation-separated, etc.). This
is an area of interest to me: it would be interesting to analyse
the attribute values in a spectrum of publishing/scientific
languages (SVG, XHTML, XSL, etc.) to see if there
are, in fact, a great variety of tokenizing types or if there
only a handful of parameterizable types (i.e. to avoid
going to regular expressions or parsers.).
(The presence non-tag structure is actually built into ISO SGML:
you can, inside an element, declare a map which recognizes
certain strings as delimiters that introduce or separate structures.
So this is not some fancy wishful thinking, but something that
XML gave up for parsing simplicity. I am not trying to
reintroduce SHORTREF into XML! But there is data that has
structure that we want to validate but not split into different
information units: dates and URLs are good examples.)
I hope this is some use.