OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] An alternative formulation of the document-centric/data-ce

[ Lists Home | Date Index | Thread Index ]

At 11:01 AM 6/3/2004 +0100, Sean McGrath wrote:
Document-centric XML:
        XML in which corpora conforming to schema X, exhibit power law distributions of the element types in X.

Data-centric XML:
        XML in which corpora conforming to schema X, exhibit uniform distributions of the element types in X.

Not perfect but useful nonetheless I think. Mixed content is missing for a start.

Anyway, please take a look at the graphs at:

I'd be very interested in seeing other peoples graphs of the tag-share of their XML corpora.

This reminds me of a classic paper by Darrell Raymond and Frank Tompa called "Hypertext and the Oxford English Dictionary" from the Communications of the ACM in 1988 or so.   At Waterloo -- Tim Bray was also part of this work at the time -- they had a research program on how to handle large text data/hypertexts like the OED (in preparation to create electronic versions) and they did a lot of very clever analyses of the dictionary, which had just been turned into SGML via conversion from the typesetting tapes.   The paper includes several charts showing the distribution of  (a) entry length,  (b) number of tags per entry (c), number of cross references and so on and either explicitly or implicitly they show tag-share in the dictionary to have the kind of distribution that Sean has in his analyses. 

Rick Jellife has some software that does the same sort of thing that I saw demonstrated at the GCA XML conferences the last year or so.

But I don't buy into this data-centric vs doc-centric view of the world. It is obviously a continuum   (called the "Document Type Spectrum" in the Document Engineering book  I'm writing with Tim McGrath [just about done, MIT Press early 2005]).   On one end are pure narrative things and on the other end are purely transactional ones:   Moby Dick to invoices.  IIn the middle are hybrid types like catalogs and reference books that have lots of structured content mixed in with narrative content. 

 I always use Moby Dick as the endpoint when I talk about this because its opening line is "call me XML"  or something like that. :-)

-bob glushko


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS