xml-dev - RE: need for defining standard APIs for xml storage

RE: need for defining standard APIs for xml storage

[ Lists Home | Date Index | Thread Index ]

From: "Didier PH Martin" <martind@netfolder.com>
To: "Dongwook Shin" <dwshin@nlm.nih.gov>, <xml-dev@xml.org>
Date: Fri, 31 Mar 2000 15:16:57 -0500

Hi Dongwook

Dongwook said:
I have been developing XML indexing and retrieval engines, which can

scale up to large XML collections. And I also see that the similar
ones that only creates DOM and searches in DOM fail to scale
in large collections.

Every time I develop XML IR systems, I need API for XML storage.
The only time I want to invoke DOM is in indexing, which is usually
performed off-line. In retrieval, I do not want to rely on DOM,
since it may spend huge amount of memory, which seems crucial
in degrading retrieval performance. Instead, I want to use light-weight
index that maps elements to real data. To create such kind of
index without depending on specific repositories, it seems important
to have a well-defined API for XML storages.

Didier replies:
The recent posting about XML queries made me think a bit more on the
subject. I think that concretely speaking if the DOM would be augmented with
a function like
node-set = selectNodes (queryType, Expression) where the query type could be
for instance "XPath" or "XQL" or whatever and that the expression is a
string representing the expression we have here a useful construct.

It seems also that we need two kind of queries
a) queries based on the elements.
b) queries based on the data content.

The last query is needed when not all information is tagged. In this case we
end up with a situation where the knowledge is stored in the data content
but not tagged and therefore need to be indexed to be easily retrieved.

So, if the DOM would include a function such as: node-set = selectNodes
(queryType, Expression) then we can have any kind of queries applied on an
information set without having to add a new function each time we add a new
query type.

Now about your indexes, what kind of algorithm are you using  for the
elements?

On our side, as we get more experimental data, we are moving toward a world
where the permanent information set uses some of the grove concepts. We have
now the right element for this: the xinclude:include element. If a data
source somewhere can take a URL as request, and if this data source can
return an XML document fragment, then even a big collection can be managed
by all kinds of tools. I'll explain it more, be patient....

a) imagine now that you have an information set where a big chunk of it is
stored somewhere else. Moreover, that this chunk of information is
dynamically created. to do so, we have a document as:

<mydocument>
  <element1>
           ....
  </element1>
  <element 2>
      <xinclude:include href="http://myfavoritesqlserver.com/sql=select
name, address, profile from customerDB where profile=good-customer"/>
  </element2>

  etc....
</mydocument>

b) imagine now that this document is stored in a permanent information set
(or GROVE if you whish). we only store the xinclude element in the permanent
information set. This element is used as a kind of external link.

c) a user request a XPath like "/mydocument/element2[name='albert Einstein"
then, the information set engine would talk to the sql server with the SQL
request. The SQL engine uses it set of B+* trees to retrieve the information
and return an XML document. From this document we continue to resolve the
XPath expression until we get the right "albert Einstein".

d) you can imagine also the same scenario with a different query language
like XQL that should allows you to select a range.

now, you can also imagine that one of the nodes represent a topic and that
this topic is a keyword located in different documents.

You can as well choose to have an xlink element to point to an XML fragment
instead of an xinclude:include element.

The point here to note is that a permanent information set can have its
content in diverse forms, but still show an XML face. So, structured and
unstructured content can be freely intermixed. B+* trees can be used to
retrieve structured content and text indexing indexes used to retrieve
unstructured content. So, as soon as we talk about XML storage it is better
to have an engine that can wraps this back end diversity. To hide the fact
that the XML hierarchy does not necessarily comes from a text document.
Finally that this diversity is resolved by a small set of data types like
nodes, node-sets, etc...

In fact, all this stuff is what the GROVE is about. Now the DOM should
evolve to be not only an interface to parsed text documents but also an
interface to information sets. An information set is not necessarily coming
from a text document. In fact, information sets could be the latest
incarnation of hierarchical databases. Or the latest incarnation of an
aggregation tool. We are slowly evolving toward that goal.

Cheers





***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************

Follow-Ups:
- Re: need for defining standard APIs for xml storage
  - From: Dongwook Shin <dwshin@nlm.nih.gov>

Next by Date: Need some info about C++ Xalan
Next by thread: Re: need for defining standard APIs for xml storage
Index(es):
- Date
- Thread