[
Lists Home |
Date Index |
Thread Index
]
The October 2005 issue of ACM Queue [1] is dedicated to the topic
of semi-structured data, and has several excellent articles on XML.
Here's an excerpt from one:
"XML and Semi-Structured Data"
By C. M. Sperberg-Mcqueen (World Wide Web Consortium)
In: ACM Queue Volume 3, Number 8 (October 2005), pages 34-41
Special Issue on Semi-Structured Data
Excerpts:
XML makes several contributions to solving the problem of
semi-structured data, the term database theorists use to denote
data that exhibits any of the following characteristics: (1)
Numerous repeating fields and structures in a naive hierarchical
representation of the data, which lead to large numbers of tables
in a second- or third-normal form representation; (2) Wide
variation in structure; (3) Sparse tables. XML provides a
natural representation for hierarchical structures and repeating
fields or structures.
Further, XML document type definitions (DTDs) and schemas allow
fine-grained control over how much variation to allow in the data:
Vocabulary designers can require XML data to be perfectly regular,
or they can allow a little variation, or a lot. In the extreme
case, an XML vocabulary can effectively say that there are no
rules at all beyond those required of all well-formed XML. Because
XML syntax records only what is present, not everything that might
be present, sparse data does not make the XML representation awkward;
XML storage systems are typically built to handle sparse data
gracefully. The most important contribution XML makes to the
problem of semi-structured data, however, is to call into question
the nature and existence of the problem. As the description makes
clear, semi-structured data is just data that does not fit neatly
into the relational model. Referring to 'the problem of semi-
structured data' suggests subliminally that the problem lies in
the failure of the data to live up fully to the relational model,
rather than in the model and its failure fully to support the
natural structure of the data.
XML invites us to model the structure of our information with
elements that form a tree structure, attributes that decorate the
nodes of the tree, and inter-nodal links that allow us to model
arbitrary graphs, not just trees. For this tree structure, XML
provides a straightforward linear representation in the form of
a labeled bracketing, which can be used for serial transfer of
information. Fundamentally, XML is simply a labeled bracketing in
which every element is labeled both at its beginning, with a
start-tag, and at its end, with an end-tag.
XML invites us to model information as a tree, but it need not be
processed in that form. XML can be understood, and processed, at
several different levels of abstraction:
* As a character stream (this is the layer actually defined by
the XML spec itself)
* As a sequence of data characters interspersed with markup
(a regular language)
* As a tree in the obvious way, with one node per element, and
the attributes as decorations on the nodes
* As a graph in which internodal links are defined by
parent-child relations between XML elements, by ID/IDREF links,
or by application-specific methods of linking between elements
* As a tree or graph annotated with information about data types
and validity (as the output of schema validation)
* As an instance of an application data structure, with
arbitrary structure, built on the basis of the XML input.
[...]
By offering tree structures, instead of just lines of characters
or tabular structures, XML dramatically enriches the possibilities
for representation of documents and other information. Many kinds
of information, documents among them, have prominent hierarchical
organization and their representation using XML is dramatically
more natural and convenient than using competing notations. But
the hard fact is that in many kinds of interesting data, hierarchical
structures coexist with other, competing hierarchical structures,
or with information that resists any kind of hierarchy.
To take a simple example: A book typically has a hierarchical
logical structure of front matter, body, and back matter, with
the body being subdivided into chapters, sections, subsections,
and so on; but books also have a physical structure of volume,
gathering, opening, page, column, line. Whenever paragraphs flow
across page boundaries -- that is, virtually always -- these two
hierarchies come into conflict. This topic has been of interest
to markup theorists for at least 20 years, and new proposals
continue to appear: Concurrent markup hierarchies, colored XML,
GODDAG (general ordered-descendant directed acyclic graph)
structures, just-in-time trees, LMNL (Layered Markup Annotation
Language), and range algebras are just a few of the more
interesting recent proposals.
XML has inherited from formal language theory as defined by Noam
Chomsky in 1957 the notion that a language is a Boolean set of
strings.3 Applied to documents and document grammars, this means
that documents are either valid and members of the set or else
invalid and not members. In reality, some errors are more severe
than others, and our systems would be less rigid and brittle if
our notion of validity allowed continuous ('fuzzy') values instead
of forcing a black/white distinction. The rigidity of the
distinction is one reason that some XML users prefer not to use
document grammars. A more flexible notion of validity would make
writing flexible applications possible without giving in to
dirty data.
Given the massive proliferation of schemaless XML vocabularies,
the need for tools to support grammar induction is increasing:
Given a body of XML data, what grammars can be written that
describe the data? There are several more or less widely known
efforts in this area, from the attempt to generate a grammar for
the New Oxford English Dictionary in the late 1980s to the
industrially oriented grammar induction of the Fred project at
OCLC (Online Computer Library Center).
Schemaless or not, the number of XML vocabularies is exploding
and unlikely to shrink anytime soon. Both in the context of data
integration projects that provide searching over a federation of
data sources, and in the context of a single project working
with an evolving document grammar, applications of the data-
exchange problem to XML are important. Given two schemas
S1 and S2, allow the convenient specification of a mapping from
S1 to S2 or find such a mapping automatically. Given that mapping
and a query against schema S2, translate the query into terms of
schema S1 to allow the data to be filtered without first being
materialized in schema S2.
How does XML help solve the semi-structured data problem? XML
provides a tool for representing and grappling with the data and
recognizing the complexity of its inbuilt structure.
[1] http://www.acmqueue.org/
[2] http://www.acmqueue.org/modules.php?name=Content&pa=list_pages_issues&issue_id=27
--
Robin Cover
|