I have a slightly different take on the distinction between "structured" 
and "unstructured" (and the less-well understood "semi-structured").

I agree that SQL data is well structured, not because its intended meaning 
is unambiguous (hah! you should see some of the databases...but that's 
another rant), but because every piece of information is "there".  SQL, of 
course, represents data as rectangular structures called tables.  A table 
is a structure, having a particular number of columns, in which there are 
rows of data, each having exactly one value corresponding to each column of 
the table.  SQL doesn't use the word "cell", but it's convenient to use in 
this discussion.  Every cell in every SQL table has a value.  That value 
might be SQL's "null value", but the cell is always "there".

Unstructured data is...well, unstructured.  A decent example is the text of 
this email message.  You might perceive structure, such as paragraphs and 
sentences, but those are artifacts of my use of common English/Western 
conventions, not actual structure.  And, most importantly, there is no 
single "thing" that you can identify that is required, optional, or 
prohibited in this message.  There is no structure at all.

HTML, and (more importantly to many) XML, are semi-structured by nature, 
although it is certainly possible to force specific scenarios using those 
markup languages to be fully structured (by requiring validation against a 
DTD or Schema that makes everything mandatory, for example).  To me, 
"semi-structured" means that there is structure there, but it is not 
completely reliable.  Information may be missing entirely...not present but 
marked as "unknown" or "missing" or "irrelevant" (analogous to some 
meanings for SQL's null value)...but completely absent.

I could not, in good conscience, call HTML "structured" by any stretch of 
the meaning.  But it is certainly not unstructured, either.  I must fall 
back on that hybrid concept with the name "semi-structured".

Hope this helps,

At 8/9/2005 09:35 AM, ian.graham@utoronto.ca wrote:
>Quoting "DuCharme, Bob (LNG-CHO)" <bob.ducharme@lexisnexis.com>:
>Yes +1
>OTOH, I've seen stuff so horrible on both counts it arguably should be "No"
> > >Is HTML structured or unstructured information?
> >
> > Yes!
> >
> > But seriously... if "Structured information may be characterized as
> > information whose intended meaning is unambiguous" and "The canonical
> > example of structured information is a relational database table" then the
> > article is building from a shaky premise, because the intended meaning of
> > the data in a relational database table can easily be ambiguous.
> >
> > If it means that a relational table is structured because the individual
> > pieces of information in it are clearly delineated and their structural
> > relation is unambiguous, which makes sense to me, then I would consider
> > HTML
> > structured, especially when compared to the article's examples of
> > unstructured information.
> >
> > Bob
> > weblog: http://www.oreillynet.com/pub/au/1191
> > homepage: http://www.snee.com/bob
