Lists Home |
Date Index |
- To: "Bill Lindsey" <firstname.lastname@example.org>
- Subject: RE: [xml-dev] Indexing solution for native XML database
- From: "Michael Rys" <email@example.com>
- Date: Wed, 30 Nov 2005 16:16:50 -0800
- Cc: <firstname.lastname@example.org>
- Thread-index: AcX14BGhz1gFtR6sSUed8zchXjy+4AALA+BA
- Thread-topic: [xml-dev] Indexing solution for native XML database
Thanks for chiming in. Your approach looks somewhat similar to the
primary XML index that we use, except that our node id encoding
(ORDPATHs, see SIGMOD 2004 paper) is much better w.r.t. updates (they
can be done local and scale as well) and we can use our cost-based
optimizer to chose the best plan instead of having to hard code it.
> -----Original Message-----
> From: Bill Lindsey [mailto:email@example.com]
> Sent: Wednesday, November 30, 2005 10:58 AM
> To: Michael Rys
> Cc: firstname.lastname@example.org
> Subject: Re: [xml-dev] Indexing solution for native XML database
> Michael Rys wrote:
> > I know of at least one that was doing ok: B-Bop. They got bought out
> when they were not
> > growing as fast as their VCs where hoping...
> Some 6 years ago, I was working at B-Bop, when the founders decided
> we must implement an xml database on top of COTS RDBMS, I told them
> that it had been tried before with SGML, and the cost of self-joins
> would be too high. Nevertheless, they insisted.
> We devised a technique of encoding name and positional context in
> "tumbler" strings that allowed us to get around the self join
> performance hit, and leverage the RDBMS's indexes.
> I'll try to give a simplified description, here:
> Most of the interesting information was in two tables:
> nameID int
> name varchar
> docID int
> nodeID int
> positionPath varchar
> namePath varchar
> reverseNamePath varchar
> charValue varchar
> positionPath, namePath and reverseNamePath are the "tumbler"
> strings. They represent a sequence of numbers as the concatenation
> of fixed length strings. For instance, if I want to indicate
> that a node is the third child of the fourth child of the document
> element, I might represent that as the sequence 0, 4, 3, or
> in the "tumbler" notation in positionPath as "_0._4._3".
> Attributes are always assigned the sequence number "0".
> Likewise, we can assign all XML names in the database to numbers,
> and encode the names of ancestor elements.
> Given an XML document:
> <foo><bar> something <baz blort='yadda'> new </baz></bar></foo>
> The tables would look like:
> XML Names
> nameID name
> ------ -----
> 0 #PCDATA
> 1 foo
> 2 bar
> 3 baz
> 4 blort
> docID nodeID positionPath namePath reverseNamePath charValue
> ----- ----- ------------ ---------- --------------- ----------
> 1 1 _0._1._1 _1._2._0 _0._2._1 ' something
> 1 2 _0._1._2._0 _1._2._3._4 _4._3._2._1 'yadda'
> 1 3 _0._1._2._1 _1._2._3._0 _0._3._2._1 ' new '
> This lets us use the RDBMS's indexing facilities to quickly
> retrieve branches of documents with a given ancestry.
> /foo/bar//node() -->
> select * from LeafNodes where namePath like '_1._2%'
> order by positionPath
> Finding all values of the attribute 'blort' in a 'baz' element would
> //baz/@blort -->
> select * from LeafNodes where reverseNamePath like '_4._3%'
> The resulting rows from these queries contain enough information to
> allow some procedural code to fire off a sequence of SAX start/end
> element events or build further queries to bring back related portions
> of the trees.
> While many queries had O(log n) performance, space usage was high,
> inserts were expensive and updates generally meant removing and
> re-inserting the entire document.
> Bill Lindsey http://www.blnz.com