XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] The String Datatype is the Worst Datatype Ever Created

You will not be able to interchange textual data that comes from a database, then, because (practically by definition), databases contain variable - and usually not only enumerated - data.

I wonder what kind of XML files you envision that don't contain data from databases. For computer to computer processing, it would seem that you could only send pre-arranged command signals. Not much need for XML for that.

Even the IP packets you brought up as an exemplar are used to transmit varying textual data. It's only for the control slots that enumerations are used, not in the contents that the packets carry.

TomP

On 9/23/2015 7:27 AM, Costello, Roger L. wrote:
Hi Folks,


  Highlights

String elements are cesspools of garbage and hacker exploits.

Use enumerations.

Enumerations are symbols. Symbolic processing is the key to success.


  Scope of discussion

The following discussion applies only to data that is to be processed
exclusively by machines (i.e., machine-to-machine processing). It does
not apply to data that is to be processed by humans.


  Lately I have been studying the IP protocol

IP has a header with numerous fields. Some of the fields have numeric
values: Version, Header Length, Fragment Offset, etc. Some of the fields
are "text" fields (I quote the word "text" because IP is actually a
binary format; nonetheless, these "text" fields denote symbols with
well-defined semantics): Type of Service, Flags, Protocol, etc. What is
significant about these text fields is that their allowable values are
enumerated:

Type of Service: Normal Delay, Low Delay, Normal Throughput, High
Throughput, Normal Reliability, High Reliability

Flags: May Fragment, Don't Fragment, Last Fragment, More Fragments

Protocol: TCP, UDP


  I've also been studying the TCP protocol

TCP also has a header with numerous fields. Some of the fields are
numeric, some are text. The text fields have enumerated values, e.g.

Control Field: URG (urgent), ACK (acknowledgement), PSH (push), RST
(rest connection), SYN (synchronize), FIN (finish)


  What do these data formats (protocols) have in common?

Answer: they don't allow text fields to contain arbitrary (unspecified)
strings. The allowable values are enumerated and clearly defined.

That makes sense, right? After all, how would machines (routers,
gateways) make routing decisions on arbitrary strings? Answer: they can't.

Likewise, machines cannot process XML documents that contain arbitrary
strings. Don't use the string datatype in XML Schemas. Ever.


  You might argue …

/But Roger, your favorite example is a Book:/

<Book>
<Title>Illusions The Adventures of a Reluctant Messiah</Title>
<Author>Richard Bach</Author>
<Date>1977</Date>
<ISBN>0-440-34319-4</ISBN>
<Publisher>Dell Publishing Co.</Publisher>
</Book>

/How do you intend to enumerate all the authors in the world? All the
book titles in the world? All the publishers in the world?/

Answer: I will remove those elements. Author, Title, and Publisher have
no business being in an XML document that is to be processed by
machines. If you can't enumerate it, don't include it. In this example,
ISBN is sufficient to identify the book. None of the other fields are
needed. The Title, Author, and Publisher (string) elements are simply
cesspools for garbage and hacker exploits.


  What about the cost to update enumeration lists?

/Modifying a schema to include a new enumeration value is expensive,
that's why we use strings/. I'm not buying that argument. So what if a
string datatype enables your XML instances to use new data; if the
machines haven't been updated to understand the new data, you have
achieved nothing.


  Symbolic Processing

An enumeration is a symbol. When you use enumerations, you are in the
realm of symbolic processing. A couple months ago I attended a talk by
Stephen Wolfram and he said a key to his company's success is symbolic
processing. I believe this is what he was referring to.


  Are constrained strings okay?

No.  Suppose you set maxLength to 5 (you don't constrain the character
set). Well, the number of permutations of 5 characters over the entire
Unicode character set is astronomical. There's no way you are going to
be able to specify the semantics of each permutation. Stick with
enumerations.


  Comments?

/Roger








[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS