OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
RE: The String Datatype is the Worst Datatype Ever Created

I see that Roger has started another interesting discussion.  Let me throw in a thought or two.

It is impossible to completely separate the data from the people.  People supply the meaning of all data, which is always relative to the purpose of at least one person.  Machines don't think.  This is Dr. Scott's Rule of Data #1.

It is difficult but useful to distinguish between data for machine processing and data for user presentation.  The key factor is the situation of the person who understands the data and provides the meaning relative to the purpose.  For machine processing, that person is the programmer, who understands the data specification but never sees the actual data at runtime.  I have slides for this concept in various NIEM briefings. [http://dodccrp.org/events/19th_iccrts_2014/post_conference/presentations/129.pdf]

Some data is for both machine processing and user presentation.  Consider the street address.  One might think this is always for user presentation, but not so.  Machine processing reads that address and calculates carrier routes for the post office, allocates packages and does route planning for trucks, etc.

For some kinds of machine processing it is usually better to use enumerations instead of strings when feasible.  For example, if the data drives a case statement, like this

	switch (country_code) {
	case CAN:
		printf("I think you mean North Montana :-)\n");

Database index fields are another place where you would rather have enumerations than strings.  Short messages over a bandwidth-constrained network is yet another.  But there are always exceptions. Difficulties in configuration management sometimes mean you must allow a string value in addition to your enumeration / controlled vocabulary.  Some data elements cannot be feasibly enumerated, like National Stock Number.  And so forth...

So I think that properly qualified, a variant of Roger's rule might make sense, something like:  For machine-processed data, an enumeration / controlled vocabulary is usually preferable to an unconstrained string, when feasible.  Feasibility is high when the set of terms is small and relatively stable.  Feasibility is low when the set is large and fast-changing.  An enumeration of state codes is feasible; an enumeration of book titles, um... not so much.

-- scott

Dr. Scott Renner, The MITRE Corporation
+1 703-983-1206 (office); +1 978-831-2598 (cell)


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS