Lists Home |
Date Index |
Let me further this off-topic discussion by pasting my definition of
semi-structured and unstructured from my Phd thesis (no
self-agrandizing going on here.. just so happens that I was working in
this area). HTML is disqualified from being structured under my
"In our research we have focused on documents that contain structured
or semi-structured data. That means that information in these documents
is sufficiently grouped, or structured, to enable us to identify fragments
of information and refer to them. This type of information encompasses
the usual “text” forms of semi-structured, marked up documents like XML
[Bray et al., 2000], database tables, source code files, and so on.
Another important characteristic of the types of documents we are
interested in is that they are commonly used to store information that
corresponds to some model, be it implicit or expressed formally:
* A Java source code file follows the Java grammar, as specified in BNF form.
* A product catalogue corresponds to the business model of what makes a product
and a catalogue
* An XMI [OMG, 2000b] file usually conforms to the UML
[Object Management Group, 1997] meta-model, or some other appropriate
The clearly identifiable parts of these documents have something in common:
their data items carry static semantics in the application domain. We want
to check that these semantics actually hold."
> I have a slightly different take on the distinction between "structured"
> and "unstructured" (and the less-well understood "semi-structured").
> I agree that SQL data is well structured, not because its intended meaning
> is unambiguous (hah! you should see some of the databases...but that's
> another rant), but because every piece of information is "there". SQL, of
> course, represents data as rectangular structures called tables. A table
> is a structure, having a particular number of columns, in which there are
> rows of data, each having exactly one value corresponding to each column of
> the table. SQL doesn't use the word "cell", but it's convenient to use in
> this discussion. Every cell in every SQL table has a value. That value
> might be SQL's "null value", but the cell is always "there".
> Unstructured data is...well, unstructured. A decent example is the text of
> this email message. You might perceive structure, such as paragraphs and
> sentences, but those are artifacts of my use of common English/Western
> conventions, not actual structure. And, most importantly, there is no
> single "thing" that you can identify that is required, optional, or
> prohibited in this message. There is no structure at all.
> HTML, and (more importantly to many) XML, are semi-structured by nature,
> although it is certainly possible to force specific scenarios using those
> markup languages to be fully structured (by requiring validation against a
> DTD or Schema that makes everything mandatory, for example). To me,
> "semi-structured" means that there is structure there, but it is not
> completely reliable. Information may be missing entirely...not present but
> marked as "unknown" or "missing" or "irrelevant" (analogous to some
> meanings for SQL's null value)...but completely absent.
> I could not, in good conscience, call HTML "structured" by any stretch of
> the meaning. But it is certainly not unstructured, either. I must fall
> back on that hybrid concept with the name "semi-structured".
> Hope this helps,
> At 8/9/2005 09:35 AM, firstname.lastname@example.org wrote:
>>Quoting "DuCharme, Bob (LNG-CHO)" <email@example.com>:
>>OTOH, I've seen stuff so horrible on both counts it arguably should be "No"
>> > >Is HTML structured or unstructured information?
>> > Yes!
>> > But seriously... if "Structured information may be characterized as
>> > information whose intended meaning is unambiguous" and "The canonical
>> > example of structured information is a relational database table" then the
>> > article is building from a shaky premise, because the intended meaning of
>> > the data in a relational database table can easily be ambiguous.
>> > If it means that a relational table is structured because the individual
>> > pieces of information in it are clearly delineated and their structural
>> > relation is unambiguous, which makes sense to me, then I would consider
>> > HTML
>> > structured, especially when compared to the article's examples of
>> > unstructured information.
>> > Bob
>> > weblog: http://www.oreillynet.com/pub/au/1191
>> > homepage: http://www.snee.com/bob
>>The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>>initiative of OASIS <http://www.oasis-open.org>
>>The list archives are at http://lists.xml.org/archives/xml-dev/
>>To subscribe or unsubscribe from this list use the subscription
> Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144
> Co-Chair, W3C XML Query WG; F&O (etc.) editor Fax : +1.801.942.3345
> Oracle Corporation Oracle Email: jim dot melton at oracle dot com
> 1930 Viscounti Drive Standards email: jim dot melton at acm dot org
> Sandy, UT 84093-1063 USA Personal email: jim at melton dot name
> = Facts are facts. But any opinions expressed are the opinions
> = only of myself and may or may not reflect the opinions of anybody =
> = else with whom I may or may not have discussed the issues at hand. =
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> The list archives are at http://lists.xml.org/archives/xml-dev/
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>