xml-dev - Indexing solution for native XML database

Indexing solution for native XML database

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Indexing solution for native XML database
From: Timo Hildén <castaway@daug.net>
Date: Mon, 28 Nov 2005 10:15:54 +0200 (EET)

Hello all,

I'm searching for an indexing solution for my native XML database project,
which I'm writing as a learning project.

I use C++ as development language and a relational database as backend.
I'm using a data model based on the XQuery 1.0 / XPath 2.0 Data Model
(W3C), and I have tables in the relational database corresponding to the
seven node kinds mentioned in that specification. I've implemented simple
XPath support and everything seems to work fine so far, but performance is
a gruesome problem. The more data I store into the database, the slower
expression processing gets. No surprise there.

The expressions are processed using the schema and instance data in the
underlying relational database in compilation and execution phases,
respectively. I think I've indexed the columns in the relational database
as optimally as possible, so no further speedups can be achieved at that
level.

So, some sort of indexing scheme is clearly required at the data model
level. Frankly, I'm quite unfamiliar with indexing when it comes to native
XML databases, especially cases that use a relational database as backend,
as mine does. I'd like to have some kind of automatic structural indexing
at least.

I'll provide some background information about my project, simplified for
clarity. Let's say I have the following schema:

<xs:element name="Patrol" maxOccurs="unbounded">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="Officer" maxOccurs="2">
        <xs:complexType>
          <xs:sequence>
            <xs:element name="FullName" type="xs:string" />
          </xs:sequence>
        </xs:complexType>
      </xs:element>
      <xs:element name="Arrests" type="xs:integer" />
    </xs:sequence>
  </xs:complexType>
</xs:element>

I also have several instance documents based upon it:

<Patrol>
  <Officer>
    <FullName>Carey Mahoney</FullName>
  </Officer>
  <Officer>
    <FullName>Larvell Jones</FullName>
  </Officer>
  <Arrests>17</Arrests>
</Patrol>

... etc ...

Here is an excerpt of the SchemaNodeElement table that represents schema
information of the elements in the data model:

Table: SchemaNodeElement
+----+----------+----------+----------+
| Id | ParentId | Name     | DataType |
+----+----------+----------+----------+
|  1 |     NULL | Patrol   |        0 |
|  2 |        1 | Officer  |        0 |
|  3 |        2 | FullName |        1 |
|  4 |        2 | Arrests  |        2 |

... where:

DataType 0 = complex type
         1 = string
         2 = integer

Here is an excerpt the element node table, representing instance data:

Table: NodeElement
+----+----------+----------+--------------+
| Id | ParentId | Name     | TypedValueId |
+----+----------+----------+--------------+
|  1 |     NULL | Patrol   |         NULL |
|  2 |        1 | Officer  |         NULL |
|  3 |        2 | FullName |            1 |
|  4 |        2 | FullName |            2 |
|  5 |        1 | Arrests  |            1 |
...

Typed-value for complex types is NULL.

I also have separate tables for each datatype, as follows:

Table: TypedValueInt
+----+-------+
| Id | Value |
+----+-------+
|  1 |    17 |
...

Table: TypedValueText
+----+---------------+
| Id | Value         |
+----+---------------+
|  1 | Carey Mahoney |
|  2 | Larvell Jones |
...

XPath expression processing is divided into two phases: compilation and
execution. As an example, consider the following expression:

/Patrol[Arrests>10]/Officer

The compiler does the schema validation by executing SQL statements
against the schema data in the underlying database, testing that the
structure is valid and retrieving the datatypes for values needed in
comparisons (in this example, datatype for the Arrests element is
retrieved). In addition, the compilation base generates the SQL for the
execution phase.

After the compilation succeeds, the expression can be executed. Suppose
something like the following is generated by the compiler (this may or may
not be correct, but I know you get the idea):

SELECT Id
FROM NodeElement
WHERE Name = 'Officer'
AND ParentId IN (
    SELECT Id
    FROM NodeElement
    WHERE Name = 'Patrol'
    AND Id IN (
        SELECT ParentId
        FROM NodeElement e, TypedValueInt i
        WHERE e.Name = 'Arrests'
        AND e.TypedValueId = i.Id
        AND i.Value > 10
    )
)

As you can see, the expressions are processed against the schema data
(compilation phase) or instance data (execution phase) in the underlying
relational database.

So far so good. But then, as the number of document instances increases in
the database, the performance degrades linearly, if not worse. This is
where I guess some kind of indexing solution is essential.

What kind of efficient indexing solutions are there, considering this kind
of native XML database? Structural indexing would be the most important
issue I suppose, followed by value indexing and perhaps also full-text
indexing.

Any help is appreciated.

-- 
Regards,

Timo

Follow-Ups:
- RE: [xml-dev] Indexing solution for native XML database
  - From: "Michael Kay" <mike@saxonica.com>

Prev by Date: RE: [xml-dev] combining XML documents
Next by Date: RE: [xml-dev] Problems with xs:redefine
Previous by thread: Fw: [xml-dev] should redirect
Next by thread: RE: [xml-dev] Indexing solution for native XML database
Index(es):
- Date
- Thread