xml-dev - Re: [xml-dev] XML storage structure

Re: [xml-dev] XML storage structure

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] XML storage structure
From: Burak Emir <Burak.Emir@epfl.ch>
Date: Wed, 19 Nov 2003 10:32:52 +0100
Cc: Aytaç Oral <aytor@kocbank.com.tr>
In-reply-to: <014924A34F04D6119A120002A54115D1116B8A@SISTEK>
References: <014924A34F04D6119A120002A54115D1116B8A@SISTEK>
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624

Selam Aytaç,

Aytaç Oral wrote:

>Hi,
>I'm looking for resources about XML storage structure for limited internal
>memory databases which will expedite join operations. Drawing of the storage
>structure and a brief explantion are the things for my research. Any links,
>doc? 
>  
>
In my work on the integration of XML in the Scala language 
http://scala.epfl.ch, I came to experiment with a couple of things, 
which might be interesting. I will simply enumerate some possibilities

(1.a) Assume you would like to maintain a tree structure, and you can 
afford copying the tree for updates, like insertion of a node, or an 
attribute. Then often it is enough to do "hash-consing".

This old technique of compiler construction simply avoids to construct 
duplicate trees. It is very simple to implement. In your SAX traversal, 
before you create a node you compute a hashcode for each node based on 
the element name, present attributes (keys and values) and the hashcodes 
of the children. Then you look up whether a node with the same hashcode 
was created, and if yes, if it contains really the same information. If 
yes, you reuse that one instead of creating a node. The result is a 
directed acyclic graph where no information is duplicated.

Highly repetitive, real life documents, like DNA or address databases 
can shrink considerably, and for things like transformation, you never 
need to update the original tree. Consider things like <year>2003</year> 
or <city>Lausanne</city>

(1.b) This technique can be refined.

Cut away all leaves (PCDATA nodes) of the tree and store them 
separately, together with the path information where they occurred. Then 
do hash-consing on the "naked" tree (ignoring leaves). As a consequence, 
more information is shared.

However, you now have to carry along path information when you navigate 
along the tree. When you access a leaf, you use it to retrieve it from 
the table where the leaves are stored. It is really enough to store a 
path as a sequence of numbers ("Dewey-notation") like 1.4.2.3.

(2) If you insist on relational tables and SQL, and you do not need to 
preserve document order, this case is dealt with e.g. in pp.172 of

Serge Abiteboul, Peter Buneman, Dan Suciu. "Data on the Web". (2000)

Summing up, without a schema (here used for type,dtd,rng,...) all you 
can do is store the transitions of the document tree. Every edge has its 
own row. Navigation, like XPath queries, can be translated to SQL 
queries, but they are expensive because of the joins.

However, if you know the set of documents in advance, you can extract a 
schema (an "index"), which is also described there.

A schema is a good thing because you can use the type information for 
inlining (storing attributes and children nodes in the same table row). 
This reduces the number of joins, as you mentioned. A good reference for 
inlining techniques is described in

Javayel et al. ***Relational** Databases for Querying XML Documents: 
Limitations and Opportunities (1999)*
http://citeseer.nj.nec.com/295071.html

(3) My colleague Seb Maneth has a data compression algorithm that seems 
to outperform the method in (1.b) while the resulting compressed tree 
means fully accessible. This is ongoing work, there is an implementation 
in C and I think he will have a paper out soon.

(4) If you use a data-binding tool, then you can deal with storage of 
objects instead of storage of XML.

Scala has a data binding tool which reads in a DTD and then maps element 
types to object types, together with the necessary parsing machinery. 
Numerous others exist... This is really only useful for in-memory 
representation, and if you work in a statically typed language, 
preferably object-oriented. For persistency one needs again the 
techniques described above.

A technical comment to advertise my research cause. (You may safely skip 
if you can afford untyped XML processing:-)

Since most type systems do not offer regular types, some type 
information gets lost during data binding. The only exceptions I am 
aware of are the research efforts Xduce and FX (Haruo Hosoya, Jerome 
Vouillon, Benjamin Pierce),  Microsoft XEN (Erik Meijer, Wolfram 
Schulte) and soon our own language.

Not a data binding tool, but I think only Jerome Simeon and Mary 
Fernandez' Galax implementation of XQuery offers some static typing, and 
Mary's ECOOP 2003 talk picked up the very same topic of mapping the data 
model to more efficient storage structures as known for tables and SQL.

Hope this helps.

cheers,
Burak

References:
- XML storage structure
  - From: Aytaç Oral <aytor@kocbank.com.tr>

Prev by Date: Fat clients/thin data (Re: [xml-dev] RE: Why can't we all work together?XML UI Languages Abound)
Next by Date: RE: Blog Pruning
Previous by thread: XML storage structure
Next by thread: Call for Papers: DATAX Workshop 2004
Index(es):
- Date
- Thread