OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: [xml-dev] Xml is _not_ selfdescribing

[ Lists Home | Date Index | Thread Index ]

>-----Message d'origine-----
>De : Elliotte Rusty Harold [mailto:elharo@metalab.unc.edu]
>Envoye : mardi 15 janvier 2002 16:17
>A : xml-dev@lists.xml.org
>Objet : Re: [xml-dev] Xml is _not_ selfdescribing
>At 2:52 PM +0100 1/15/02, Jens Jakob Andersen, PDI wrote:
>>Hello all
>>I think that it is fair to conclude now, that XML is _not_ any more 
>>selfdescribing than e.g. CSV files.
>That's ridiculous. XML absolutely is more self-describing than CSV. 
>Nothing here has proven otherwise.  Your claim is indicative of the 
>flawed binary logic that pervades much of the Internet. XML is not 
>perfectly self-describing. Therefore it is not self-describing. But 
>that's only a syllogism in binary logic. The real world isn't binary. 
>It's fuzzy. There are degrees of things, including degrees of 
>No serious analysis of how XML is actually used vs. how CSV files are 
>actually used could possibly deny that XML is more self describing. 
>The possibility that XML tag names could be chosen randomly does not 
>evade the fact that they are not chosen randomly in the vast majority 
>of cases. The evidence that some (though far from all) XML 
>applications use extremely opaque tag names does not imply that there 
>is no meaning there, or that this meaning cannot be teased out of an 
>XML document by a sufficiently determined researcher. The need for 
>genuine intelligence to comprehend and make use of this meaning does 
>not make it useless.
>In reverse, the possibility of using column names in CSV files does 
>not help in any way with the large proportion of CSV files that don't 
>use column names. That the rows of a CSV file can match the column 
>names doesn't help at all when they don't. In the real world, XML is 
>simply easier to work with than CSV.

Once again, the problem here is very subtle. Tag names do improve
self-description of XML tags in the same way that CSV column names does.

If we consider XML a way to serialise labelled trees, the simplest readable
equivalent to column names in a serialised representation of a labeled tree
is to write schema elements names in place, like XML tag names or YAML
elements names. Separating data from meta-data in a header and body fashion
(a la CSV) could be possible is some cases, but not readable at all (try to
picture it).

The subtle thing here is that apart from the hierachical vs. flat structure
difference, very is no semantical "leap" from CSV to XML. Tag names are
basically the same things as column headers. Therefore, there is, I insist,
no more self-description of data in XML documents than in CSV files.

You point the fact that lots of CSV files don't have column headers. Fine.
Then let's just create a CSV++ specification that enforce column headers. Et
voila ! We got a so-called "self-describing" CSV format.

Like Bill de Hora says, there is no magical means by which a program can
understand XML data better than CSV data. This is, however, a claim that has
been heard often enough to justify the fact that we react against this.

I think that for most people on this list, "self-description" just means
that the meta-data being embedded with the data, a human reader does not
need to refer to a separate documentation or schema to find out what the
data means. However, the fact that the meta-data is embedded does not change
the nature and meaning of the data for a computer. Data remains basically
data, and a human mind is required to interpret it and write programs that
manipulate it.

However, for newspapers or IT managers, "self-description" is a term that
adds knowledge and intelligence to the data, meaning that a computer program
could use the self-description (i.e. the meta-data) to learn and adapt from
updated or unknown document types. Mentioning "self-description" make the
innocent reader believe the programs will use this self-description, which
is not true (as mentioned earlier, very very few programs process data at
the meta-data level).

This is a corrolar to the "XML is the Lingua Franca of IT" (whereas it
should only be considered as one of its alphabets). This concept of
"self-description" spawned the idea that programs could make sense of any
kind of XML document, hence the "XML as the next computing revolution" hype.
And sooner or later, we'll have to pay the price of arrogance, as more and
more people will find out that XML is "just another data format". This could
be the roots of a severe backlash, that would throw away the baby with the
bathtub and benefit the next "computing revolution".


P.S. instead of CSV, you can read "any kind of tabular data format precise
enough not to worry us with character  sets, character escaping, etc.". As
Mark Seaborne pointed it out, CSV is hardly a format.


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS