|
Re: [xml-dev] An alternative formulation of the document-centric/data-ce
|
[
Lists Home |
Date Index |
Thread Index
]
At 11:01 AM 6/3/2004 +0100, Sean McGrath wrote:
Document-centric XML:
XML in
which corpora conforming to schema X, exhibit power law distributions of
the element types in X.
Data-centric XML:
XML in
which corpora conforming to schema X, exhibit uniform distributions of
the element types in X.
Not perfect but useful nonetheless I think. Mixed content is missing for
a start.
Anyway, please take a look at the graphs at:
http://seanmcgrath.blogspot.com/2004_05_23_seanmcgrath_archive.html#108576202776583412
I'd be very interested in seeing other peoples graphs of the tag-share of their XML corpora.
This reminds me of a classic paper by Darrell Raymond and Frank Tompa called "Hypertext and the Oxford English Dictionary" from the Communications of the ACM in 1988 or so. At Waterloo -- Tim Bray was also part of this work at the time -- they had a research program on how to handle large text data/hypertexts like the OED (in preparation to create electronic versions) and they did a lot of very clever analyses of the dictionary, which had just been turned into SGML via conversion from the typesetting tapes. The paper includes several charts showing the distribution of (a) entry length, (b) number of tags per entry (c), number of cross references and so on and either explicitly or implicitly they show tag-share in the dictionary to have the kind of distribution that Sean has in his analyses.
Rick Jellife has some software that does the same sort of thing that I saw demonstrated at the GCA XML conferences the last year or so.
But I don't buy into this data-centric vs doc-centric view of the world. It is obviously a continuum (called the "Document Type Spectrum" in the Document Engineering book I'm writing with Tim McGrath [just about done, MIT Press early 2005]). On one end are pure narrative things and on the other end are purely transactional ones: Moby Dick to invoices. IIn the middle are hybrid types like catalogs and reference books that have lots of structured content mixed in with narrative content.
I always use Moby Dick as the endpoint when I talk about this because its opening line is "call me XML" or something like that. :-)
-bob glushko
|
|
|
|
|