[
Lists Home |
Date Index |
Thread Index
]
Selam Aytaç,
Aytaç Oral wrote:
>Hi,
>I'm looking for resources about XML storage structure for limited internal
>memory databases which will expedite join operations. Drawing of the storage
>structure and a brief explantion are the things for my research. Any links,
>doc?
>
>
In my work on the integration of XML in the Scala language
http://scala.epfl.ch, I came to experiment with a couple of things,
which might be interesting. I will simply enumerate some possibilities
(1.a) Assume you would like to maintain a tree structure, and you can
afford copying the tree for updates, like insertion of a node, or an
attribute. Then often it is enough to do "hash-consing".
This old technique of compiler construction simply avoids to construct
duplicate trees. It is very simple to implement. In your SAX traversal,
before you create a node you compute a hashcode for each node based on
the element name, present attributes (keys and values) and the hashcodes
of the children. Then you look up whether a node with the same hashcode
was created, and if yes, if it contains really the same information. If
yes, you reuse that one instead of creating a node. The result is a
directed acyclic graph where no information is duplicated.
Highly repetitive, real life documents, like DNA or address databases
can shrink considerably, and for things like transformation, you never
need to update the original tree. Consider things like <year>2003</year>
or <city>Lausanne</city>
(1.b) This technique can be refined.
Cut away all leaves (PCDATA nodes) of the tree and store them
separately, together with the path information where they occurred. Then
do hash-consing on the "naked" tree (ignoring leaves). As a consequence,
more information is shared.
However, you now have to carry along path information when you navigate
along the tree. When you access a leaf, you use it to retrieve it from
the table where the leaves are stored. It is really enough to store a
path as a sequence of numbers ("Dewey-notation") like 1.4.2.3.
(2) If you insist on relational tables and SQL, and you do not need to
preserve document order, this case is dealt with e.g. in pp.172 of
Serge Abiteboul, Peter Buneman, Dan Suciu. "Data on the Web". (2000)
Summing up, without a schema (here used for type,dtd,rng,...) all you
can do is store the transitions of the document tree. Every edge has its
own row. Navigation, like XPath queries, can be translated to SQL
queries, but they are expensive because of the joins.
However, if you know the set of documents in advance, you can extract a
schema (an "index"), which is also described there.
A schema is a good thing because you can use the type information for
inlining (storing attributes and children nodes in the same table row).
This reduces the number of joins, as you mentioned. A good reference for
inlining techniques is described in
Javayel et al. ***Relational** Databases for Querying XML Documents:
Limitations and Opportunities (1999)*
http://citeseer.nj.nec.com/295071.html
(3) My colleague Seb Maneth has a data compression algorithm that seems
to outperform the method in (1.b) while the resulting compressed tree
means fully accessible. This is ongoing work, there is an implementation
in C and I think he will have a paper out soon.
(4) If you use a data-binding tool, then you can deal with storage of
objects instead of storage of XML.
Scala has a data binding tool which reads in a DTD and then maps element
types to object types, together with the necessary parsing machinery.
Numerous others exist... This is really only useful for in-memory
representation, and if you work in a statically typed language,
preferably object-oriented. For persistency one needs again the
techniques described above.
A technical comment to advertise my research cause. (You may safely skip
if you can afford untyped XML processing:-)
Since most type systems do not offer regular types, some type
information gets lost during data binding. The only exceptions I am
aware of are the research efforts Xduce and FX (Haruo Hosoya, Jerome
Vouillon, Benjamin Pierce), Microsoft XEN (Erik Meijer, Wolfram
Schulte) and soon our own language.
Not a data binding tool, but I think only Jerome Simeon and Mary
Fernandez' Galax implementation of XQuery offers some static typing, and
Mary's ECOOP 2003 talk picked up the very same topic of mapping the data
model to more efficient storage structures as known for tables and SQL.
Hope this helps.
cheers,
Burak
|