OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Is marking up a classification act?

[ Lists Home | Date Index | Thread Index ]


As another sequel of the recent discussions about ontology is the inference
I made about the act of marking up documents.

Is there any issues raised by this statement:

"Marking up documents is, in fact, classifying information".

If I send an XML document, I can join its related schema. This will provide
you its syntax constraints. However, what is missing is my mental model
behind the classification I used. Said differently, what is missing is an
answer to the question "what do you mean by.". Thus, marking up document is
also expressing a view of the world, and this view is based on a mental
model, logic and theories about the world.

Note about automatic classification:
When an HTML or XHTML document is marked up, it gives me clues about what
are headers and what are paragraphs. If I am a classification engine trying
to discover other "tacit" views of the world expressed by this very
document, I can allocate more weight to text contained in headers than to
text contained in paragraphs. A header is supposed by convention to
synthesize the following text and give, in a nutshell, the essence of the
following text. If, in addition, the paragraph contains other tagged text I
can extract additional information about the text. However, some issues may
be raised here.
a) My view of the world is not right, then my marked up text is not well
classified and therefore this leads to classification errors.
b) I simply made a mistake. Again, same result as above. Just consider the
number of errors an average programmer is doing when writing a program.
These programmers are lucky that compilers help correct them. What about
natural language now, what kind of compiler can help us prevent errors?
c) The classification is fuzzy. The tagged item is 40% part of a particular
set (i.e. category) and 20% to a different set and finally 40% to another. A
human can easily resolve that classification ambiguity (however some can't).
Can Hal resolve that? (We all know the result demonstrated in the movie).
Usually the ownership is resolve by the overall context.
d) The task is so time consuming and error prone, I think that outside a
pleasant intellectual game with the intent to learn something I wouldn't do
that for the other documents I am writing. 

From the engineering point of view, I can design a language that will be
based on solid mathematically foundations. However, in practice, when I am
trying to build a document that will provide some information about the view
of the world behind it, it is not that easy. I guess this is why people
don't do that and they let automatic agents like search engines to classify
them. My neighbor is now reassured, the planet of the computers, matrix or
AI are not for tomorrow, we have not yet found a way to teach machines some
common sense :-)

Didier PH Martin


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS