OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: [xml-dev] Is HTML structured or unstructured information?

[ Lists Home | Date Index | Thread Index ]

I have a slightly different take on the distinction between "structured" 
and "unstructured" (and the less-well understood "semi-structured").

I agree that SQL data is well structured, not because its intended meaning 
is unambiguous (hah! you should see some of the databases...but that's 
another rant), but because every piece of information is "there".  SQL, of 
course, represents data as rectangular structures called tables.  A table 
is a structure, having a particular number of columns, in which there are 
rows of data, each having exactly one value corresponding to each column of 
the table.  SQL doesn't use the word "cell", but it's convenient to use in 
this discussion.  Every cell in every SQL table has a value.  That value 
might be SQL's "null value", but the cell is always "there".

Unstructured data is...well, unstructured.  A decent example is the text of 
this email message.  You might perceive structure, such as paragraphs and 
sentences, but those are artifacts of my use of common English/Western 
conventions, not actual structure.  And, most importantly, there is no 
single "thing" that you can identify that is required, optional, or 
prohibited in this message.  There is no structure at all.

HTML, and (more importantly to many) XML, are semi-structured by nature, 
although it is certainly possible to force specific scenarios using those 
markup languages to be fully structured (by requiring validation against a 
DTD or Schema that makes everything mandatory, for example).  To me, 
"semi-structured" means that there is structure there, but it is not 
completely reliable.  Information may be missing entirely...not present but 
marked as "unknown" or "missing" or "irrelevant" (analogous to some 
meanings for SQL's null value)...but completely absent.

I could not, in good conscience, call HTML "structured" by any stretch of 
the meaning.  But it is certainly not unstructured, either.  I must fall 
back on that hybrid concept with the name "semi-structured".

Hope this helps,

At 8/9/2005 09:35 AM, ian.graham@utoronto.ca wrote:
>Quoting "DuCharme, Bob (LNG-CHO)" <bob.ducharme@lexisnexis.com>:
>Yes +1
>OTOH, I've seen stuff so horrible on both counts it arguably should be "No"
> > >Is HTML structured or unstructured information?
> >
> > Yes!
> >
> > But seriously... if "Structured information may be characterized as
> > information whose intended meaning is unambiguous" and "The canonical
> > example of structured information is a relational database table" then the
> > article is building from a shaky premise, because the intended meaning of
> > the data in a relational database table can easily be ambiguous.
> >
> > If it means that a relational table is structured because the individual
> > pieces of information in it are clearly delineated and their structural
> > relation is unambiguous, which makes sense to me, then I would consider
> > HTML
> > structured, especially when compared to the article's examples of
> > unstructured information.
> >
> > Bob
> > weblog: http://www.oreillynet.com/pub/au/1191
> > homepage: http://www.snee.com/bob
>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>initiative of OASIS <http://www.oasis-open.org>
>The list archives are at http://lists.xml.org/archives/xml-dev/
>To subscribe or unsubscribe from this list use the subscription
>manager: <http://www.oasis-open.org/mlmanage/index.php>

Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
   Co-Chair, W3C XML Query WG; F&O (etc.) editor    Fax : +1.801.942.3345
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA          Personal email: jim at melton dot name
=  Facts are facts.   But any opinions expressed are the opinions      =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS