xml-dev - Re: [xml-dev] Some random noise on rational type systems for XML

Re: [xml-dev] Some random noise on rational type systems for XML

[ Lists Home | Date Index | Thread Index ]

To: John Cowan <cowan@mercury.ccil.org>
Subject: Re: [xml-dev] Some random noise on rational type systems for XML
From: Amelia A.Lewis <amyzing@talsever.com>
Date: Tue, 6 May 2003 23:06:19 -0400
Cc: xml-dev@lists.xml.org
In-reply-to: <20030507012838.GE3507@ccil.org>
Organization: The Mysthical World of Talsever!
References: <20030506182451.4f40d846.amyzing@talsever.com><20030507012838.GE3507@ccil.org>

On Tue, 6 May 2003 21:28:38 -0400
John Cowan <cowan@mercury.ccil.org> wrote:
> Amelia A. Lewis scripsit:
> 
> > I think one of the worst problems with W3C XML Schema's types is
> > that they do not represent a system.
> 
> Agreed.  Let's see if we can construct something better, especially
> since there is an open DSDL slot for such a thing.
> 
> > First principle: the XML ur-type is "string".  Everything in XML can
> > be represented as a string (MUST be representable as a string).  It
> > can therefore be manipulated as a string--truncated, concatenated,
> > case-transformed, etc.  
> 
> No, I have to disagree here.  Every datatype instance can be
> *represented* by a string, right enough.  That does not mean the
> instance of that type*is* a string.  I can be represented by a string
> too:  "John Cowan". That doesn't make me a string.

I'm going to let Joe English respond here ...
Joe English wrote:
> But the _XML_ ur-type is string.  From the application
> point of view, you might have dates, integers, IEEE double
> precision floating point numbers, et cetera, but as far
> as XML is concerned everything is a string.

That's terser than I know how to write without an editor.

> After all, every date and duration can be represented as a number.

Irrelevant.  XML doesn't store numbers.

> For that matter, every string can be represented as a number by some
> trick such as making each character a digit in base 2^20+2^16
> notation.

Irrelevant.  XML doesn't store numbers to base 2^20+2^16 (unless you
mean to suggest Unicode, in which case this is just another way of
saying that everything is a string, in XML).

> That doesn't make you say that dates are numbers or that strings are
> numbers. 

When I'm dealing with Java, Dates are long integers (signed 64-bit
ones-complement integers) measuring milliseconds since the epoch.  In
Perl, I can certainly treat a string as a number or as a bitfield.

The point here is what I really badly want to label "the Bray ploy" as
applied to value types.  Tim is famous for the slogan "XML is syntax". 
Applying that to value types, XML values are strings.

> Nor are strings or numbers octet-sequences, either, although
> of course they have several well-known representations as such.
> Representation is a red herring.

There is a circumstance for which that is true: the type system permits
multiple roots.  If, and only if, the type system permits multiple
roots, then it is reasonable to restrict the notion of "string" to
linguistic elements (words in English or Russian, for example).

W3C XML Schema wants to be singly-rooted.  So there's an ur-type.  The
ur-type *is* a string, even though its *name* is "anySimpleType".  When
you munge something and don't know its schema type, so that all you can
do is munge it as an anySimpleType, then you munge it as a string of (a
subset of) Unicode characters.

> > boolean
> > binary [octet-stream]
> > number
> > date
> > duration
> 
> If you add string to this list as an equal, I think it's pretty
> winning.

What, for ever and ever?  I don't.  I think it's a starting point, but I
think it may be a complete red herring as well.

It might be far better to start off with a definition of how a
"primitive" type can be defined.  W3C XML Schema says: "Revise this spec
in whole."  Very stable, but given the warts with the current set, that
is not a good thing.  The main thing that the experience of W3C XML
Schema gives us is experience in misdefinition.  We SHOULD use that
experience.

How?

I would assert that W3C XML Schema is flawed because it is incomplete,
because it does not contain a mechanism that allows it to be completed,
and because it is so complex that it verges on incomprehensibility,
which acts as another barrier to making it more complete.

A replacement MUST be more comprehensible.  That requires a small set of
rules, or a small set of types, or both.  The rules SHOULD be related to
one another in such a fashion that they are generally easy to remember.

W3C XML Schema gives us examples of algorithms used to derive types
(which are called facets), all of which impose restrictions, with one
exception.  The exception is a low-power combinator, which creates a
list from string-tokens-without-whitespace plus whitespace separators. 
The algorithms for derivation are probably a good starting point; more
combinators (Ostap Bender?) are certainly possible, as is adding power
to the existing one (allow the separator character to be specified and
the power of the list type is immediately vastly strengthened).

> > Hmm.  We're missing one.  Ah, that's it: QName.  Question: does XML
> > need a pointer type?  Which would, of course, be represented as a
> > string.  If so, it might include, for instance, QName, XPath
> > expressions, and URIs. Let's say that there's an abstract pointer,
> > maybe.
> 
> The difficulty is that QNames are really different from URIs, because
> their interpretation is extremely context-sensitive, and you can't
> tell just by looking at the representation of one whether it actually
> refers to anything or not.
> 
> QName is an irritating datatype, but if we have to have it, it needs
> to be a seventh equal partner.  IRIs, OTOH, really are a subtype of
> strings: their definition is purely syntactic.

Here's the quibble, though.  If you include QNames, should you not also
include XPath expressions, which are used for much the same purpose, and
which have the same context sensitivity?  Does that mean two datatypes? 
Or one base one, with a means of deriving QName and XPath expression
from that base one?  What if some working group manages to come up with
an XPointer equivalent that people can actually agree upon and use; is
that not also likely to have context sensitivity, and to have a clear
relation to the existing pointer-like types?

If we agree that we're imposing a type system from the outside, then the
mandate is to create one that is comprehensible and complete (or
comprehensible and completable).  "Comprehensible" here means
establishing a simple rule to indicate that a particular type is
influenced by the structure in which it is embedded.  In this case, it
is influenced by the current set of namespaces in scope; true for QName
and for XPath expressions.  The semantic, for this type, is that it
indicates the name of a node or rule for selecting a node-set; it is the
XML equivalent of a pointer.

You're bound to go on as you start: should that be an enumeration of
permitted types, or a way of specifying how to specify a primitive type?
 It would be possible, for instance, to state that a type library may
include normative description, in which is contained the algorithm for
validating a primitive of a given type.  Subtypes then use standard,
well-known algorithms (rules) for restricting the content.  The
difference between a primitive and a derived type becomes: a primitive
type defines new rules, specified by algorithm; a derived type restricts
the value space of an existing primitive (or more than one primitive).

I tend toward favoring anarchy and competition: define a set of rules
for producing primitive types, and solicit contributions.  Let the users
of the parsers, through their bug reports/feature requests, determine
the best-defined, most useful primitives.  That pattern, though, has its
drawbacks, including failure of contributions, and failure of interest
by users because insufficient types are available to start with.  It
*does* recognize the WXS experience, by rejecting the top-down
specification of primitive types, and by starting from a position in
which further primitives MAY be added (without running the gauntlet of a
working group that has a published and accepted recommendation out).

Amy!
-- 
Amelia A. Lewis                    amyzing {at} talsever.com
Boxing is a lot like ballet, except that they don't dance, there isn't
any music, and they hit each other.

Follow-Ups:
- Re: [xml-dev] Some random noise on rational type systems for XML
  - From: james anderson <james.anderson@setf.de>

References:
- Some random noise on rational type systems for XML
  - From: Amelia A.Lewis <amyzing@talsever.com>
- Re: [xml-dev] Some random noise on rational type systems for XML
  - From: John Cowan <cowan@mercury.ccil.org>

Prev by Date: IEEE Computer: "XML Raises Concerns" (Was Re: [xml-dev] XML Sucks)
Next by Date: RE: [xml-dev] IEEE Computer: "XML Raises Concerns" (Was Re: [xml-dev] XML Sucks)
Previous by thread: Re: [xml-dev] Some random noise on rational type systems for XML
Next by thread: Re: [xml-dev] Some random noise on rational type systems for XML
Index(es):
- Date
- Thread