XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] The String Datatype is the Worst Datatype Ever Created

Let’s also not forget that even where strings are constrained, we don’t always have the luxury of using a type system that allows us to describe the precise constraints.

For example, not may type systems (certainly not XSD) allow one to say “This string must be a valid XPath expression” or “This string must be a valid Java program”.

Michael Kay
Saxonica


On 23 Sep 2015, at 12:27, Costello, Roger L. <costello@mitre.org> wrote:

Hi Folks,

Highlights

String elements are cesspools of garbage and hacker exploits.

Use enumerations.

Enumerations are symbols. Symbolic processing is the key to success.

Scope of discussion

The following discussion applies only to data that is to be processed exclusively by machines (i.e., machine-to-machine processing). It does not apply to data that is to be processed by humans.

Lately I have been studying the IP protocol

IP has a header with numerous fields. Some of the fields have numeric values: Version, Header Length, Fragment Offset, etc. Some of the fields are "text" fields (I quote the word "text" because IP is actually a binary format; nonetheless, these "text" fields denote symbols with well-defined semantics): Type of Service, Flags, Protocol, etc. What is significant about these text fields is that their allowable values are enumerated:

Type of Service: Normal Delay, Low Delay, Normal Throughput, High Throughput, Normal Reliability, High Reliability

Flags: May Fragment, Don't Fragment, Last Fragment, More Fragments

Protocol: TCP, UDP

I've also been studying the TCP protocol

TCP also has a header with numerous fields. Some of the fields are numeric, some are text. The text fields have enumerated values, e.g.

Control Field: URG (urgent), ACK (acknowledgement), PSH (push), RST (rest connection), SYN (synchronize), FIN (finish)

What do these data formats (protocols) have in common?

Answer: they don't allow text fields to contain arbitrary (unspecified) strings. The allowable values are enumerated and clearly defined.

That makes sense, right? After all, how would machines (routers, gateways) make routing decisions on arbitrary strings? Answer: they can't.

Likewise, machines cannot process XML documents that contain arbitrary strings. Don't use the string datatype in XML Schemas. Ever.

You might argue …

But Roger, your favorite example is a Book:

<Book>
   
<Title>Illusions The Adventures of a Reluctant Messiah</Title>
   
<Author>Richard Bach</Author>
   
<Date>1977</Date>
   
<ISBN>0-440-34319-4</ISBN>
   
<Publisher>Dell Publishing Co.</Publisher>
</Book>

How do you intend to enumerate all the authors in the world? All the book titles in the world? All the publishers in the world?

Answer: I will remove those elements. Author, Title, and Publisher have no business being in an XML document that is to be processed by machines. If you can't enumerate it, don't include it. In this example, ISBN is sufficient to identify the book. None of the other fields are needed. The Title, Author, and Publisher (string) elements are simply cesspools for garbage and hacker exploits.

What about the cost to update enumeration lists?

Modifying a schema to include a new enumeration value is expensive, that's why we use strings. I'm not buying that argument. So what if a string datatype enables your XML instances to use new data; if the machines haven't been updated to understand the new data, you have achieved nothing.

Symbolic Processing

An enumeration is a symbol. When you use enumerations, you are in the realm of symbolic processing. A couple months ago I attended a talk by Stephen Wolfram and he said a key to his company's success is symbolic processing. I believe this is what he was referring to.

Are constrained strings okay?

No.  Suppose you set maxLength to 5 (you don't constrain the character set). Well, the number of permutations of 5 characters over the entire Unicode character set is astronomical. There's no way you are going to be able to specify the semantics of each permutation. Stick with enumerations.

Comments?

 

/Roger

 




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS