OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   RE: [xml-dev] Indexing solution for native XML database

[ Lists Home | Date Index | Thread Index ]
  • To: Timo Hildén <castaway@daug.net>,<xml-dev@lists.xml.org>
  • Subject: RE: [xml-dev] Indexing solution for native XML database
  • From: "Michael Rys" <mrys@microsoft.com>
  • Date: Wed, 30 Nov 2005 05:50:47 -0800
  • Thread-index: AcXz9ExPJJ135YGHQA63MrcUSs+0IQBwG6YA
  • Thread-topic: [xml-dev] Indexing solution for native XML database

I would recommend that you take a look at some of the VLDB and SIGMOD papers that we have published on how we implement XML and XQuery in SQL Server. You can find the links to our papers on my weblog at http://sqljunkies.com/weblog/mrys. IBM and Oracle also have papers at recent VLDBs and SIGMODs describing their approaches.

While all of us drive it deeper into the engine than what is possible for you, you may find some ideas of tackling your performance issues.

Best regards
Michael

> -----Original Message-----
> From: Timo Hildén [mailto:castaway@daug.net]
> Sent: Monday, November 28, 2005 12:16 AM
> To: xml-dev@lists.xml.org
> Subject: [xml-dev] Indexing solution for native XML database
> 
> Hello all,
> 
> I'm searching for an indexing solution for my native XML database project,
> which I'm writing as a learning project.
> 
> I use C++ as development language and a relational database as backend.
> I'm using a data model based on the XQuery 1.0 / XPath 2.0 Data Model
> (W3C), and I have tables in the relational database corresponding to the
> seven node kinds mentioned in that specification. I've implemented simple
> XPath support and everything seems to work fine so far, but performance is
> a gruesome problem. The more data I store into the database, the slower
> expression processing gets. No surprise there.
> 
> The expressions are processed using the schema and instance data in the
> underlying relational database in compilation and execution phases,
> respectively. I think I've indexed the columns in the relational database
> as optimally as possible, so no further speedups can be achieved at that
> level.
> 
> So, some sort of indexing scheme is clearly required at the data model
> level. Frankly, I'm quite unfamiliar with indexing when it comes to native
> XML databases, especially cases that use a relational database as backend,
> as mine does. I'd like to have some kind of automatic structural indexing
> at least.
> 
> I'll provide some background information about my project, simplified for
> clarity. Let's say I have the following schema:
> 
> <xs:element name="Patrol" maxOccurs="unbounded">
>   <xs:complexType>
>     <xs:sequence>
>       <xs:element name="Officer" maxOccurs="2">
>         <xs:complexType>
>           <xs:sequence>
>             <xs:element name="FullName" type="xs:string" />
>           </xs:sequence>
>         </xs:complexType>
>       </xs:element>
>       <xs:element name="Arrests" type="xs:integer" />
>     </xs:sequence>
>   </xs:complexType>
> </xs:element>
> 
> I also have several instance documents based upon it:
> 
> <Patrol>
>   <Officer>
>     <FullName>Carey Mahoney</FullName>
>   </Officer>
>   <Officer>
>     <FullName>Larvell Jones</FullName>
>   </Officer>
>   <Arrests>17</Arrests>
> </Patrol>
> 
> ... etc ...
> 
> Here is an excerpt of the SchemaNodeElement table that represents schema
> information of the elements in the data model:
> 
> Table: SchemaNodeElement
> +----+----------+----------+----------+
> | Id | ParentId | Name     | DataType |
> +----+----------+----------+----------+
> |  1 |     NULL | Patrol   |        0 |
> |  2 |        1 | Officer  |        0 |
> |  3 |        2 | FullName |        1 |
> |  4 |        2 | Arrests  |        2 |
> 
> ... where:
> 
> DataType 0 = complex type
>          1 = string
>          2 = integer
> 
> Here is an excerpt the element node table, representing instance data:
> 
> Table: NodeElement
> +----+----------+----------+--------------+
> | Id | ParentId | Name     | TypedValueId |
> +----+----------+----------+--------------+
> |  1 |     NULL | Patrol   |         NULL |
> |  2 |        1 | Officer  |         NULL |
> |  3 |        2 | FullName |            1 |
> |  4 |        2 | FullName |            2 |
> |  5 |        1 | Arrests  |            1 |
> ...
> 
> Typed-value for complex types is NULL.
> 
> I also have separate tables for each datatype, as follows:
> 
> Table: TypedValueInt
> +----+-------+
> | Id | Value |
> +----+-------+
> |  1 |    17 |
> ...
> 
> Table: TypedValueText
> +----+---------------+
> | Id | Value         |
> +----+---------------+
> |  1 | Carey Mahoney |
> |  2 | Larvell Jones |
> ...
> 
> XPath expression processing is divided into two phases: compilation and
> execution. As an example, consider the following expression:
> 
> /Patrol[Arrests>10]/Officer
> 
> The compiler does the schema validation by executing SQL statements
> against the schema data in the underlying database, testing that the
> structure is valid and retrieving the datatypes for values needed in
> comparisons (in this example, datatype for the Arrests element is
> retrieved). In addition, the compilation base generates the SQL for the
> execution phase.
> 
> After the compilation succeeds, the expression can be executed. Suppose
> something like the following is generated by the compiler (this may or may
> not be correct, but I know you get the idea):
> 
> SELECT Id
> FROM NodeElement
> WHERE Name = 'Officer'
> AND ParentId IN (
>     SELECT Id
>     FROM NodeElement
>     WHERE Name = 'Patrol'
>     AND Id IN (
>         SELECT ParentId
>         FROM NodeElement e, TypedValueInt i
>         WHERE e.Name = 'Arrests'
>         AND e.TypedValueId = i.Id
>         AND i.Value > 10
>     )
> )
> 
> As you can see, the expressions are processed against the schema data
> (compilation phase) or instance data (execution phase) in the underlying
> relational database.
> 
> So far so good. But then, as the number of document instances increases in
> the database, the performance degrades linearly, if not worse. This is
> where I guess some kind of indexing solution is essential.
> 
> What kind of efficient indexing solutions are there, considering this kind
> of native XML database? Structural indexing would be the most important
> issue I suppose, followed by value indexing and perhaps also full-text
> indexing.
> 
> Any help is appreciated.
> 
> --
> Regards,
> 
> Timo
> 
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS