OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   RE: [xml-dev] Indexing solution for native XML database

[ Lists Home | Date Index | Thread Index ]
  • To: "Bill Lindsey" <bill@blnz.com>
  • Subject: RE: [xml-dev] Indexing solution for native XML database
  • From: "Michael Rys" <mrys@microsoft.com>
  • Date: Wed, 30 Nov 2005 16:16:50 -0800
  • Cc: <xml-dev@lists.xml.org>
  • Thread-index: AcX14BGhz1gFtR6sSUed8zchXjy+4AALA+BA
  • Thread-topic: [xml-dev] Indexing solution for native XML database

Hi Bill

Thanks for chiming in. Your approach looks somewhat similar to the
primary XML index that we use, except that our node id encoding
(ORDPATHs, see SIGMOD 2004 paper) is much better w.r.t. updates (they
can be done local and scale as well) and we can use our cost-based
optimizer to chose the best plan instead of having to hard code it.

Best regards
Michael

> -----Original Message-----
> From: Bill Lindsey [mailto:bill@blnz.com]
> Sent: Wednesday, November 30, 2005 10:58 AM
> To: Michael Rys
> Cc: xml-dev@lists.xml.org
> Subject: Re: [xml-dev] Indexing solution for native XML database
> 
> Michael Rys wrote:
> > I know of at least one that was doing ok: B-Bop. They got bought out
> when they were not
>  > growing as fast as their VCs where hoping...
> 
> Some 6 years ago, I was working at B-Bop, when the founders decided
> we must implement an xml database on top of COTS RDBMS, I told them
> that it had been tried before with SGML, and the cost of self-joins
> would be too high.  Nevertheless, they insisted.
> 
> We devised a technique of encoding name and positional context in
> "tumbler" strings that allowed us to get around the self join
> performance hit, and leverage the RDBMS's indexes.
> 
> I'll try to give a simplified description, here:
> 
> Most of the interesting information was in two tables:
> 
> XMLNames:
>    nameID           int
>    name             varchar
> 
> LeafNodes:
>    docID            int
>    nodeID           int
>    positionPath     varchar
>    namePath         varchar
>    reverseNamePath  varchar
>    charValue        varchar
> 
> positionPath, namePath and reverseNamePath are the "tumbler"
> strings.  They represent a sequence of numbers as the concatenation
> of fixed length strings.  For instance, if I want to indicate
> that a node is the third child of the fourth child of the document
> element,  I might represent that as the sequence 0, 4, 3, or
> in the "tumbler" notation in positionPath as "_0._4._3".
> Attributes are always assigned the sequence number "0".
> 
> Likewise, we can assign all XML names in the database to numbers,
> and encode the names of ancestor elements.
> 
> Given an XML document:
> 
>    <foo><bar> something <baz blort='yadda'> new </baz></bar></foo>
> 
> The tables would look like:
> 
> XML Names
>   nameID  name
>   ------  -----
>   0       #PCDATA
>   1       foo
>   2       bar
>   3       baz
>   4       blort
> 
> LeafNodes
>   docID nodeID positionPath  namePath     reverseNamePath  charValue
>   ----- -----  ------------  ----------   ---------------  ----------
>   1     1      _0._1._1      _1._2._0     _0._2._1         ' something
'
>   1     2      _0._1._2._0   _1._2._3._4  _4._3._2._1      'yadda'
>   1     3      _0._1._2._1   _1._2._3._0  _0._3._2._1      ' new '
> 
> This lets us use the RDBMS's indexing facilities to quickly
> retrieve branches of documents with a given ancestry.
> 
> /foo/bar//node() -->
>    select * from LeafNodes where namePath like '_1._2%'
>      order by positionPath
> 
> Finding all values of the attribute 'blort' in a 'baz' element would
be:
> 
> //baz/@blort  -->
>    select * from LeafNodes where reverseNamePath like '_4._3%'
> 
> The resulting rows from these queries contain enough information to
> allow some procedural code to fire off a sequence of SAX start/end
> element events or build further queries to bring back related portions
> of the trees.
> 
> While many queries had O(log n) performance, space usage was high,
> inserts were expensive and updates generally meant removing and
> re-inserting the entire document.
> 
> --
> Bill Lindsey        http://www.blnz.com





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS