Lists Home |
Date Index |
----- Original Message -----
From: "Elliotte Rusty Harold" <firstname.lastname@example.org>
Sent: Tuesday, January 15, 2002 9:10 PM
Subject: RE: [xml-dev] Xml is _not_ self describing
>I guess it depends on what exactly you mean by "self-describing". I
>think a book about the English language written in English is
>self-describing in and of itself, whether anybody speaks English or
I agree with you. The RELAX NG schema for RELAX NG is also self-describing.
But you'll spend many, many, many hours if you read only the schema before
beginning to find out what's the meaning of the document, and I'm speaking
of human intelligence here.
Maybe the term of "self-describing" should be made more precise by
specifying the intended audience and purpose of the self-description.
The -ing for is tricky : "self-describing" seems to mean that the data by
itself can reify its meaning.
- An XML document without any related DTD is not self-describing. It merely
transmit data about a labeled tree, there is no meta-data available. You can
check its well-formedness, but for that you just apply external
well-formedness rules to the document.
- An XML document with an embedded DTD is self-describing, for computer that
know about XML and DTDs, and for validation purposes. The document itself
provides information on how it has to be processed to be declared valid in
its own sense.
- However, outside the bounds of very precise algorithms (validation), an
XML document with an embedded DTD is not self-describing for computers in a
more general processing context. Nothing tells the computer about how the
data should be processed. The document has no control over its own fate. An
invoice document is not describing how it should be processed by an
accounting system. The information comes from elsewhere.
The latest point means that the hype 'because XML is self-describing, it is
the Lingua France of computer science, and your integration costs will drop'
is pure bullshit. We know it for sure on this list, but explaining why needs
a precise definition of what 'self-describing' means...
> When a document is marked up, the information of the markup is there,
> whether we recognize it or not. It is a property of the text itself,
> not a property of our perception of the text. With appropriate work,
> experience, intelligence, and luck that markup can be understood. Can
> unmarked up text be understood as well? Yes, certainly; but markup
> adds to the information content of the text. It makes it easier to
> decipher its meaning in a very practically useful way. This is a
> question of degree, and text+markup is easier to understand than text
By carefully examining the data in a CSV file without column header,
applying clever heuristics, you can often find out what each column means
(especially if you spot zip code, city names that you know, etc.). And
again, the fact that a CSV file is, well, comma separated, makes it easier
to parse and use than the equivalent plain-text file. Formatting rules and
markup sure *do* add information if they are used consistently.
However, I don't think it is sensible to tell that an XML file with unknown
or foreign tag names is more interesting than a CSV file without headers.
You get more information, because provided that you notice the pointy
bracket and find out that some series of characters surrounded by <> or </>
match, you can build a hierarchical model. But more information does not
means more meaning. There is no magic thing in XML that will give you the
*meaning* of the hierarchical relation, or of the data embedded inside the
tags, contrary to what the public can believe when hearing the term
"self-describing". That was the point of this "Xml is _not_ self-describing"
thread : beware of the magic connotations of "self-describing".