Lists Home |
Date Index |
I have a slightly different take on the distinction between "structured"
and "unstructured" (and the less-well understood "semi-structured").
I agree that SQL data is well structured, not because its intended meaning
is unambiguous (hah! you should see some of the databases...but that's
another rant), but because every piece of information is "there". SQL, of
course, represents data as rectangular structures called tables. A table
is a structure, having a particular number of columns, in which there are
rows of data, each having exactly one value corresponding to each column of
the table. SQL doesn't use the word "cell", but it's convenient to use in
this discussion. Every cell in every SQL table has a value. That value
might be SQL's "null value", but the cell is always "there".
Unstructured data is...well, unstructured. A decent example is the text of
this email message. You might perceive structure, such as paragraphs and
sentences, but those are artifacts of my use of common English/Western
conventions, not actual structure. And, most importantly, there is no
single "thing" that you can identify that is required, optional, or
prohibited in this message. There is no structure at all.
HTML, and (more importantly to many) XML, are semi-structured by nature,
although it is certainly possible to force specific scenarios using those
markup languages to be fully structured (by requiring validation against a
DTD or Schema that makes everything mandatory, for example). To me,
"semi-structured" means that there is structure there, but it is not
completely reliable. Information may be missing entirely...not present but
marked as "unknown" or "missing" or "irrelevant" (analogous to some
meanings for SQL's null value)...but completely absent.
I could not, in good conscience, call HTML "structured" by any stretch of
the meaning. But it is certainly not unstructured, either. I must fall
back on that hybrid concept with the name "semi-structured".
Hope this helps,
At 8/9/2005 09:35 AM, email@example.com wrote:
>Quoting "DuCharme, Bob (LNG-CHO)" <firstname.lastname@example.org>:
>OTOH, I've seen stuff so horrible on both counts it arguably should be "No"
> > >Is HTML structured or unstructured information?
> > Yes!
> > But seriously... if "Structured information may be characterized as
> > information whose intended meaning is unambiguous" and "The canonical
> > example of structured information is a relational database table" then the
> > article is building from a shaky premise, because the intended meaning of
> > the data in a relational database table can easily be ambiguous.
> > If it means that a relational table is structured because the individual
> > pieces of information in it are clearly delineated and their structural
> > relation is unambiguous, which makes sense to me, then I would consider
> > HTML
> > structured, especially when compared to the article's examples of
> > unstructured information.
> > Bob
> > weblog: http://www.oreillynet.com/pub/au/1191
> > homepage: http://www.snee.com/bob
>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>initiative of OASIS <http://www.oasis-open.org>
>The list archives are at http://lists.xml.org/archives/xml-dev/
>To subscribe or unsubscribe from this list use the subscription
Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144
Co-Chair, W3C XML Query WG; F&O (etc.) editor Fax : +1.801.942.3345
Oracle Corporation Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA Personal email: jim at melton dot name
= Facts are facts. But any opinions expressed are the opinions =
= only of myself and may or may not reflect the opinions of anybody =
= else with whom I may or may not have discussed the issues at hand. =